How to avoid pipeline backfill nightmares
Tired of the backfill process? Checkout these best practices to help you ease the pain of backfilling the data.
Backfilling is a curse in data engineering; it's a tedious and expensive process. If you like to relieve yourself from the backfilling pain, start following these best practices:
Automate When Needed
Manual backfills are the most painful
Treat as Production-grade Pipeline
Cutting corners when backfilling will just create more suffering over the long term
Divide And Conquer
Don’t try to backfill all of history at the same time. It’s like eating an elephant in one bite
Staging Area
Backfilled data should be loaded into a staging area first, tested, then promoted to production
Effective Communication
Don’t backfill in silence and wait for your analytics stakeholders to find surprising results
Limit Backfill Frequency
Backfills shouldn’t be happening often. Plan ahead to minimize the number of times you do them
Avoid Resource Hog
Be a conscientious engineer and remember your backfill might slow down other engineer’s pipelines
Automate When Needed
It is best to consider automating backfilling processes via either building a dedicated pipeline or some scripting, but it's only worth considering when working at scale as performance and cost comes into play. Automation can be overkill as mentioned in the article by SeattleDataGuy.
Treat as Production-grade Project
Dedicated backfill pipelines should be treated as production grade pipelines to avoid unnecessary cost of maintaining and future technical debt.
Backfill pipelines run infrequently but should be held to the same quality standards of production pipelines
Backfill pipeline should also leverage the same code as used in the incremental pipeline.
💡A dedicated backfill pipeline will not be required for most use case. With right tooling and approach, routine pipelines can also be leveraged for backfilling.
Divide And Conquer
When dealing with large volumes of backfill data, it's worth considering a divide and conquer approach, as this can improve efficiency, reduce errors, and make the process more manageable. Split the backfill into smaller periods based on size, requirements, or other factors. For example:
Running a backfill for last ten years could be done per year
Running a backfill for last year could be done per quarter
Profiling your upstream data before backfilling is critical to making the right call on how to divide and conquer. There are few benefits with this approach as well:
You get the opportunity to make data available to end users per period basis compared to waiting for the whole backfill to finish.
Running in pieces makes it easier to recover from failures as one big long running job could fail for many reasons.
Staging Area
Write-Audit-Publish (WAP) is a common pattern used across the industry for auditing data before publishing. Staging area makes it possible to apply this pattern. This is how it goes:
Write to staging area
Validate the staging table, skip or reduce the checks if it adds too much friction due to data quality failures. Surfacing every historical anomaly is often an ineffective use of your time when backfilling. Validate the entire data set instead of one partition at a time
Move data from staging to production table, depending on use case:
For incremental or partial backfill:
Upsert to prod table in data warehouse or lakehouse
Swap partitions in data lake
For full refresh backfill, swap a table (blue green deployment)
The whole process should be part of the pipeline. In Airflow, a production pipeline can be leveraged to perform backfill using the built in features which allows to create multiple runs of the same DAG from a certain point in time.
⭐ This (WAP) one is critical for data quality. This pattern treats publishing to production as a contract. Write to a staging table, run your quality checks, if they pass, move the data from staging to production. Zach Wilson (source: WAP)
Effective Communication
Communication is crucial when changing data with backfills. If you just start changing metrics, downstream consumers will get concerned and flood you with pings.
Each step of the backfill should be communicated:
Share the reasoning why we're doing the backfill and when it should be completed
Complex backfills often run behind schedule. Communicate these delays as well.
Once the backfill is complete, communicate the expected metric shifts from the changes you introduced.
Talk about when the old data sets will be deprecated.
Limit Backfill Frequency
Backfill on large scale data is expensive in terms of both compute and time. In order to make the most, it's better to plan out backfill and limit the frequency.
Revisit the data model thoroughly to catch relevant changes beyond scope. This minimizes post backfill requests like “add a new X column”.
Combine multiple changes into one backfill to save cost and time. E.g. You could have a backfill schedule of once every N period.
In some critical business cases, on-demand backfill is acceptable, but it should be prioritized based on its impact and urgency to ensure it addresses key issues without disrupting regular operations.
Avoid Resource Hog
Large backfills can be a disruptive process if performed on a shared cluster as it could lead to the backfill hogging all the resources at the expense of other production pipelines. It is important to find the right balance between not being a resource hog and having your backfill complete in time. To tackle this, consider:
Scheduling backfills during low-traffic periods to minimize impact on other tasks and speeding up the backfill process.
A common high-traffic period to avoid is around UTC midnight (5 PM Pacific) each day.
Monitoring resource usage closely to adjust the process and compute if it starts affecting other workloads.
Optimize your job to avoid unnecessary memory consumption when using frameworks like Spark.
In big tech, they use dedicated backfill resource queues to avoid resource conflicts with production pipelines.
Conclusion
Backfills are a necessary evil in data engineering. Doing them correctly will save you tons of money in cloud costs and more importantly, save you precious engineering time that would’ve been spent screaming at Airflow.
The DataExpert.io academy helps you learn best practices like backfilling, data modeling, and more. We’re launching a live Analytics Engineering boot camp from October 14th to November 18th!
The first ten people to use code BACKFILL30 at DataExpert.io/backfill30 get 30% off!
What other things should you consider when backfilling? Make sure to share this with your friends and on social media if you found it helpful!
Great article. I can relate to basically all of that. Thanks for such great content!
Lol, such a great article. Kudos, Junaid!