Join our FREE personalized newsletter for news, trends, and insights that matter to everyone in America

Newsletter
New

Conformed Dimensions Vs Dependency Explosion

Card image cap

Alright everyone, lower your voices. Bring it in. Let’s talk about the thing no one in data engineering is talking about right now. JUST KIDDING! Not another slop post.

For years I build independent data marts and the Kimball strategy was clear, associate each fact with as many dimensions as possible, reuse dimensions across facts as much as possible. Fill in the squares in the bus matrix. But with a modern 3 layer data lakehouse in the cloud in a big enterprise, it can't all end up as one big star because the dependency explosion will slow down changes. So you separate the stars by business case or closely-related business cases. But then every star needs the employee dimension, for example. We don't want separate employee dimensions across our org, but if every report uses the employee dimension that's a tough model to change when required.

Would love to hear how others have handled this and what were the benefits/tradeoffs.

Ideas

  1. If a fact shares most dimesnions with a given star, it goes in that star. Benefit - efficiency of development. Drawback - some dependency explosion leading to just a few very large stars.

OR

  1. Only a select few "key" dimensions are truly shared across stars. The rest are created distinct within a given star, even if we steal code from existing models. It's ok because we use natural keys or hashes of them, so ultimately cross domain analysis across star is technically possible. Benefit - individual use cases can select from a small number of "shared" dimensions and build the rest according to requirements. Drawback - similar models loaded twice resulting in extra compute and multiple versions of the truth become possible.

Thank you!

submitted by /u/Truth-and-Power
[link] [comments]