Data models describe businesses and reveal patterns that would otherwise go unnoticed. The power here is the promise of not flying in the dark. The problem is many people build planes with one wing, planes without fuel, or planes without a pilot.
Data modeling breaks down in many places:
Logging is an after thought
Third normal form isn’t achieved in production
Mismatch of schemas between production and analytical systems
You deliver based on requirements and not needs
You build a point solution for your analytics
Logging is an after thought
Software engineers focus on software. They care much more about functionality than measurements. Think of software engineers as the construction workers in the digital space.
Construction workers need to be focusing on building strong walls, doors, and fireplaces. Logging is like installing a home security system which is often a separate skill.
Just like with home security, setting up logging is fraught with problems.
The security system isn’t looking at critical areas of the home and a burglar gets a way.
Logging gaps are very common in online systems. Capturing all the metadata needed for every event will dramatically reduce the complexity of your data pipelines.
When I worked at Netflix, there was a logging gap in the network logs that didn’t include the microservice app. This forced us to initially do a two petabyte join on IP address and build a pipeline that cost over $2m per year.
After working with service owners to include that field in their network logs, we were able to reduce this pipeline to under $1m per year! This shows the power of logging data ahead of time and not always solving things in the data pipeline layer
The security system has too much latency for the footage to be valuable
This happens mostly when you’re looking at the difference between client-side logging and server-side logging.
You want to log your data as close to the event as possible which means the data generation should almost always happen client-side!
That being said, server-side logging can be a powerful way to start generating data
Key point to remember here:
Do client-side logging for accuracy and completeness
Do server-side logging because it’s better than nothing and very easy to implement
The security system is focused on the wrong things
Logging the wrong data can drastically increase cloud budgets without much gain
For example, when I worked at Facebook, an analyst added every experiment group that a user was in to every single notification event (50 billion events per day).
This array column accounted for 60% of the size of the entire table. Removing this and asking analysts to answer this question with a join allowed for pipelines in notifications to run dramatically more efficiently and all we had to was remove a column!
Third normal form (3NF) isn’t achieved in production
When I was in college, data modeling was taking some data tables and minimizing the duplication and maximizing the referential integrity of the data.
Without 3NF modeling in production, YOUR ANALYTICS WILL ALWAYS SUFFER.
Many data engineers get confused about this since third normal form has minimal value in the data lake.
The reason why it has value:
De-deduplication and data minimization
In production, you want to send the minimum amount of data to the end user that you can. 3NF
Consistency and referential integrity
Dimensions do not exist in isolation. They always exist in relationship to other dimensions. Data lakes do not have foreign keys that enforce these relationships, RDBMS systems leverage foreign keys to make sure your dimension relationships stay consistent. This is a HUGE DEAL for analytics.
Simplified transactions and updates
3NF modeling makes it so adding new high quality data is simple. Generally just adding rows of data with INSERT INTO queries.
Don’t undervalue this! If the upstream software engineers make mistakes in this step, you’ll find painful symptoms in your pipeline such as:
Needing to dedupe your dimensional snapshots
Dimensional snapshots should be deduped by default. You shouldn’t need to do anything to manage dimension duplicates
Orphan dimensions that don’t JOIN
Bad 3NF modeling will make it so there’s a big difference between LEFT JOIN and INNER JOIN in your downstream analytical pipelines.
The solution here is to not patch these things in the pipeline! You need to work with the software engineers upstream and fix the problem at the source. Remember, the further upstream you solve a data quality issue, the more impact you land as a data engineer and the better data engineer you are!
Mismatch of schemas between production and analytical systems
Production tables can be messy. Software engineers need to run ALTER TABLE commands, renaming columns, changing data types, and introducing breaking changes.
There should always be a contract between your analytical data and your production data. The most common options for contracts are Thrift and Avro. Compiling Thrift schemas is how I maintained consistency between production tables and my data pipelines at Airbnb!
If a breaking change is introduced, that should bump the version of the snapshot table so a new schema can be imagined. It’s then the data engineer’s responsibility to figure out how to marry the old and new snapshot schemas in the master data!
Remember, production database snapshots should be treated as immutable to maintain what the exact values in the database were at that moment in time!
You deliver based on requirements and not needs
Data engineers often get asked to deliver “everything” data models. These models are supposed to answer every question analytics has about the business area.
Often these requirements aren’t as real as they appear! As a data engineer, you need to validate every single column and what value it provides to the business.
Analytics often wants you to add columns “just in case.” 80% of the time these requests aren’t worth it and bloats the data model for queries that are ran less than once per quarter.
Assess the technical feasibility of every column and data point you bring in! When it comes to data modeling, remember, less is more. Even if a column is easy to add, ask the following questions:
Does this make the table less approachable?
What percentage of our users will use this column?
Designing your tables with usability in mind is what separates great data engineers from good ones.
Remember, for very infrequently used columns, consider using a Map data type with a column name other_properties like they do at Netflix. It’s better to do this than have a table with 100+ columns that is intimidating to query!
You build a point solution for your analytics
The opposite of delivering a data model that answers all the questions is one that answers a single question. Both of these patterns are bad.
Analytics often operates with a sense of urgency and wants you to build a pipeline to get one metric. As a data engineer, you should push back on these types of pipelines as much as possible. There’s a few reasons for that:
Point pipelines are the most likely to be thrown away later
Point pipelines are more likely to use incorrect definitions of things because they lack conformity and proper data modeling
Robust data modeling that answers a myriad of questions makes many point pipelines obsolete which allows analytics to move faster because they don’t need you to even build a pipeline!
When should you not push back on these requests?
The information is truly time sensitive such as leading up to a big release of a product or a genuine business crisis.
You see the business value and have a path forward that convert the point solution into a more robust data model later. Be careful here though because later often never comes!
What else would you consider when doing data modeling?
We’re covering all the critical data modeling concepts like the modern data stack, One Big Table, change data capture, and complex data types in the upcoming DataExpert.io Analytics Engineering boot camp that starts on October 14th! You can get 25% off by using code DATAMODEL25 before 10/14!