Speed is all the rage in data engineering nowadays. Stakeholders want their data sooner with higher quality and efficiency!
All this demand for speed can cause data engineers to make stupid decisions by not asking the proper follow up questions to decipher the true requirements from the false ones!
The latency / complexity / quality tradeoff
The less latency your data has the more effort you need to maintain it. Pipelines that process data every single minute of every single day behave more like REST APIs and servers than data pipelines. This difference is not insignificant! Most data engineers feel very comfortable in the CRON job that fires daily or hourly space.
Streaming pipelines have the following downsides:
Breakages need to be addressed as soon as possible like servers. This makes on-call rotations much more aggressive and maintenance tasks higher
Data quality is harder to insert into streaming pipelines. This is why Lambda vs Kappa architecture is such a debate.
You use Lambda architecture when you want to leverage batch for quality and completeness AND streaming for low-latency
You use Kappa architecture when you want to index on low-latency and simplicity
Much fewer data engineers know how to work with streaming data. <10% of all data pipelines in the world are streaming pipelines
So promising always-up-to-date, real-time data to stakeholders isn’t free! You need to consider the tradeoffs. Satisfying the requirements of what stakeholders need is much more important than satisfying what they ask for!
Have clarity when talking with stakeholders
Stakeholders will always say they want the data as quickly as possible with the highest quality possible providing the most value as possible.
The problem with this line of thinking is that it lacks constraints and you may ultimately end up building them a Ferrari when a Honda civic would have been enough!
Here are some questions to ask to get to the root of the ACTUAL latency requirements of your data pipelines.
Are there any other data engineers on my team that know streaming?
Being the only data engineer who knows how to maintain a streaming pipeline is NOT where you want to be on your team. This results in 24/7 oncall and burnout really quickly!
What is the impact of a data delay?
If the answer is the data analyst has to work with one day old data, this is a strong indicator you SHOULD use a batch pipeline
At Facebook, I worked on notifications and notification settings. The analysts really wanted low latency data. After diving into the actual impact of low-latency, we determined that hourly batch would be a more suitable solution since we could keep it as a CRON job and keep a higher level of pipeline homogeneity
Is there any automated response that happens from this data that is extremely time sensitive?
If the answer is yes, that is a strong indicator you HAVE to use streaming
At Netflix, I worked on threat detection. Waiting a day for the batch pipeline to catch the bad guy makes the product not work. This is a great example of where streaming is a MUST.
Are there any latency-sensitive predictions that need up-to-date features?
This is where you need to understand the details more. Streaming, batch, and microbatch are all on the table here.
At Facebook, I had a pipeline that deduped notification events. Using microbatch reduced the landing time from 9 hours after midnight to 1 hour after midnight. The 100+ pipelines downstream were able to start running 8 hours sooner! This made tons of data delay disappear overnight in notifications!
Picking the right technology depending on your use case
Technologies have their benefits and weaknesses. Understanding that there’s a million ways to build a pipeline!
Things to consider:
Pipeline homogeneity matters a lot! Picking technologies that your team is already using is going to help a lot with maintenance and on call!
On the flip side, learning shiny new tech is something that can keep your engineers motivated. Data engineers often feel like the SQL + Airflow grind will be the death of them!
For batch, pick Spark, BigQuery, or Snowflake. Other options are likely to be less good
Apache Beam can be a great choice for streaming since it has a batch mode that allows it to be both!
Spark structured streaming has an experimental feature for continuous processing. Although in my experience, Flink is still more robust here and Spark Structured Streaming should stick with Microbatch.
What other considerations would you think about when deciding how to architect your pipeline? If you liked this content please share with your friends! We cover all these topics in even more detail in the DataExpert.io academy! The first 7 people to use code STREAMING20 at checkout can get 20% off the self-paced course!
Nice article.
Thanks for sharing this article. This and the previous one about promotion were quite helpful for me.