11 Comments

I really enjoy your linkedin content! One note - the kimball methodology pushes to denormalize data in the dimension tables. He acknowledges this is a hot topic, but modelers should almost always reduce the urge to normalize. He says the trade off in simplicity is worth the duplicated space. He permits outrigger dimensions, but says they should be the exception, not the rule.

Agree, disagree? Thanks!

- Noah

Expand full comment

Depends on how you're modeling your data. That's the whole purpose of this article is to illustrate cases when you should and shouldn't denormalize and where I found it to be effective in my career

Expand full comment

I love your content and don't consider myself qualified to question you, but I am very confused by this article. It seems that you're contrasting a more normalized approach (which you associate to Kimball) with a less normalized approach (OBT). As Noah also pointed out, that's the complete opposite of what I'm finding in multiple other sources. I.e. Kimball is generally associated with a star schema approach - a denormalized model which accepts the cost of higher data redundancy because of the benefits of faster/simpler queries. To summarize, my other sources generally associate Kimball to denormalization, so can you explain why you associated him to the first, more normalized approach in this article?

Expand full comment

Kimball is denormalized. OBT is more denormalized

Expand full comment

Got it. Thanks for the quick reply!

Expand full comment

What if we get 10k facts per dimensions set? Would it still be optimal to pack them into an array of structs?

Expand full comment

You mentioned you need to take care about data quality in the One Big Table approach. Do you have any article to share about that? Thanks.

Expand full comment

That's a great suggestion for me to make another article. One Big Table doesn't have much literature yet it doesn't seem

Expand full comment

Nice! Looking forward to that!

Expand full comment

You mention that the constraint for broadcast joins in Spark, and therefore the general size threshold for considering the switch from Kimball to OBT, is 10gb. But the default value for spark.sql.autoBroadcastJoinThreshold is 10mb, not 10gb. Was this a mistake on your part or is the default value for that config too small in your view?

Expand full comment

Broadcast join can be bumped up a lot higher than 10 mbs. It struggles after 10 GBs though

Expand full comment