I really enjoy your linkedin content! One note - the kimball methodology pushes to denormalize data in the dimension tables. He acknowledges this is a hot topic, but modelers should almost always reduce the urge to normalize. He says the trade off in simplicity is worth the duplicated space. He permits outrigger dimensions, but says they should be the exception, not the rule.
Depends on how you're modeling your data. That's the whole purpose of this article is to illustrate cases when you should and shouldn't denormalize and where I found it to be effective in my career
I love your content and don't consider myself qualified to question you, but I am very confused by this article. It seems that you're contrasting a more normalized approach (which you associate to Kimball) with a less normalized approach (OBT). As Noah also pointed out, that's the complete opposite of what I'm finding in multiple other sources. I.e. Kimball is generally associated with a star schema approach - a denormalized model which accepts the cost of higher data redundancy because of the benefits of faster/simpler queries. To summarize, my other sources generally associate Kimball to denormalization, so can you explain why you associated him to the first, more normalized approach in this article?
You mention that the constraint for broadcast joins in Spark, and therefore the general size threshold for considering the switch from Kimball to OBT, is 10gb. But the default value for spark.sql.autoBroadcastJoinThreshold is 10mb, not 10gb. Was this a mistake on your part or is the default value for that config too small in your view?
I really enjoy your linkedin content! One note - the kimball methodology pushes to denormalize data in the dimension tables. He acknowledges this is a hot topic, but modelers should almost always reduce the urge to normalize. He says the trade off in simplicity is worth the duplicated space. He permits outrigger dimensions, but says they should be the exception, not the rule.
Agree, disagree? Thanks!
- Noah
Depends on how you're modeling your data. That's the whole purpose of this article is to illustrate cases when you should and shouldn't denormalize and where I found it to be effective in my career
I love your content and don't consider myself qualified to question you, but I am very confused by this article. It seems that you're contrasting a more normalized approach (which you associate to Kimball) with a less normalized approach (OBT). As Noah also pointed out, that's the complete opposite of what I'm finding in multiple other sources. I.e. Kimball is generally associated with a star schema approach - a denormalized model which accepts the cost of higher data redundancy because of the benefits of faster/simpler queries. To summarize, my other sources generally associate Kimball to denormalization, so can you explain why you associated him to the first, more normalized approach in this article?
Kimball is denormalized. OBT is more denormalized
Got it. Thanks for the quick reply!
What if we get 10k facts per dimensions set? Would it still be optimal to pack them into an array of structs?
You mentioned you need to take care about data quality in the One Big Table approach. Do you have any article to share about that? Thanks.
That's a great suggestion for me to make another article. One Big Table doesn't have much literature yet it doesn't seem
Nice! Looking forward to that!
You mention that the constraint for broadcast joins in Spark, and therefore the general size threshold for considering the switch from Kimball to OBT, is 10gb. But the default value for spark.sql.autoBroadcastJoinThreshold is 10mb, not 10gb. Was this a mistake on your part or is the default value for that config too small in your view?
Broadcast join can be bumped up a lot higher than 10 mbs. It struggles after 10 GBs though