

Discover more from EcZachly Data Engineering Newsletter
How to pass the data modeling round in big tech data engineering interviews
The data modeling interview separates data engineers who can solve business problems efficiently from those who can’t. Being able to deconstruct business requirements into efficient data sets that solve problems is the key skill that you want to demonstrate throughout this entire interview.
Knowledge of the following concepts will serve you very well:
Asking the right questions!
Remember the interviewer will often give you more requirements than is given in the problem statement. It’s your job to discover these requirements! The data modeling interview is 50% about technicals and 50% about communication!
Cumulative dimensions versus daily dimensions (Maxime Beauchemin covers this in-depth on his Medium)
Fact data modeling
Aggregate data modeling
Daily metrics vs. cumulative metrics
Enable rapid slice-and-dice of metrics with cube aggregates
OLAP cubes are fundamental to rapid analytics.
Your data gets crunched down to the number of combinations of dimensions.
You can model the dimensions by time, product, location, etc. This enabled you to “slice” the aggregates you care about.
Example - modeling user growth
Often times in interviews you’ll be given a vague problem like, “Give me an efficient data model that models user connection growth on a social media site. Here are the upstream schemas”
connection_events
event_time TIMESTAMP
sending_user_id BIGINT
receiving_user_id BIGINT
event_type STRING (values [“sent”, “reject”, “accept”]
event_date DATE PARTITION
active_user_snapshot (contains one row for every active user on snapshot_date)
user_id BIGINT
country STRING
age INTEGER
username STRING
snapshot_date DATE PARTITION
So how do you take these two schemas and create a data model that efficiently tracks how many connections are sent, accepted, and rejected?
One of the first indicators here that you should lean into cumulative table design is that active_user_snapshot does not have all the users for each snapshot_date.
So to start off, you’d want to cumulate active_user_snapshot in a table called: users_cumulated
The schema of this table might look something like:
users_cumulated
user_id BIGINT
dim_is_active_today BOOLEAN
l7 INTEGER (how many days were they active in the last 7 days)
active_datelist_int INTEGER (a binary integer that tracks the monthly activity history, see this article on how to leverage powerful data structures like this)
dim_country STRING
dim_age INTEGER
partition_date DATE PARTITION
This table is populated by taking active_user_snapshot WHERE snapshot_date = ‘today’ and FULL OUTER JOIN it with users_cumulated WHERE partition_date = ‘yesterday’
This table will have one row for every user each day regardless of it they are active or not.
In the interview, you’ll probably need to come up with the schema above and a diagram that looks something like this.
Great, now we have a good user dimension table to use further downstream.
The next thing we need to build is two tables. One daily dimension table called daily_user_connections and one cumulative table called user_connections_cumulated.
We create daily_user_connections for the purpose of simpler backfills in the future and fast daily-level analytics.
You can imagine a schema like
daily_user_connections
sender_user_id BIGINT
receiver_user_id BIGINT
sent_event_time TIMESTAMP
response_event_time TIMESTAMP (this is NULL if they have not accepted or rejected)
connection_status STRING [“accepted”, “rejected”, “unanswered”]
partition_date DATE PARTITION
user_connections_cumulated
(the same schema except contains all historical connections)
The schemas above and a diagram that looks something like this would be expected in the interview.
Now that we have daily_user_connections, user_connections_cumulated, and users_cumulated we are ready to JOIN these two dimensions together to create aggregate tables and analytical cubes.
What type of aggregates do we care about? You can use the upstream schemas as hints. (They probably care about age and country). If you’re unsure what aggregates matter, make sure to ask questions in the interview.
If we join together user_connections_cumulated and users_cumulated, both on sender_user_id and receiver_user_id, we can get a new schema. Let’s call it user_connections_aggregated
user_connections_aggregated
dim_sender_country STRING
dim_receiver_country STRING
dim_sender_age_bucket STRING
dim_receiver_age_bucket STRING
m_num_users BIGINT
m_num_requests BIGINT
m_num_accepts BIGINT
m_num_rejects BIGINT
m_num_unanswered BIGINT
aggregation_level STRING PARTITION KEY
partition_date DATE PARTITION KEY
This this case we can generate this type of table by doing a GROUPING SETS query on top of a JOIN between user_connections_cumulated and users_cumulated. We also want to bucketize age into categorys like <18, 18-30, 30-50, etc. This is to lower the cardinality and make the dashboards more performant.
The GROUPING SETS statement would probably look something like this:
GROUP BY GROUPING SETS (
(),
(dim_sender_country),
(dim_sender_country, dim_receiver_country),
(dim_sender_country, dim_sender_age),
(dim_receiver_country, dim_receiver_age),
(dim_sender_country, dim_receiver_country, dim_sender_age, dim_receiver_age)
)
The aggregation_level partition is determined by which columns we are grouping on. If it’s just dim_sender_country, then the aggregation_level would be the string literal ‘dim_sender_country’, at Facebook they concatted the aggregation_levels with ‘__’ so if you grouped on dim_sender_country and dim_receiver_country the aggregation_level would be ‘dim_sender_country__dim_receiver_country’
A lot of the details here around GROUPING SETS aren’t needed in the interview though. You just need to talk about the different grains and aggregates you need to produce not the nitty-gritty details I’m going over here.
Aside from the schema for user_connections_aggregate, you’ll need to produce a diagram that looks something like this:
You’ll see that user_connections_aggregate is an OLAP cube that. is ready to be sliced and diced by analysts!
The last piece of the puzzle is coming up with metrics based on this aggregate table. There are easy ones like:
rejection_rate which is defined as m_num_rejects / m_num_requests
acceptance_rate which is defined as m_num_accepts / m_num_requests
connections_per_user which is defined as m_num_accepts / m_num_users
You don’t have to come up with a ton of metrics during the interview but a few is good to wrap up the modeling exercise!
After defining a few important business metrics, you’ll end up with a diagram that looks something like this:
Conclusion
If you can produce diagrams and schemas like the ones above and talk intelligently about the tradeoffs, you’ll pass this round of interview with ease.
I’ve found this round of interview to be fun and engaging since the correct answer is often ambiguous and requires a lot of back and forth with the interviewer!
If you want to learn more about data modeling and other critical data engineering concepts, join my six week intensive course that covers everything from data modeling to Kafka to Flink to Spark to Airflow and more!