DataEngineer.io Newsletter

DataEngineer.io Newsletter

Share this post

DataEngineer.io Newsletter
DataEngineer.io Newsletter
How to save millions by optimizing data pipeline shuffling
Copy link
Facebook
Email
Notes
More

How to save millions by optimizing data pipeline shuffling

Zach Wilson's avatar
Zach Wilson
Mar 11, 2024
∙ Paid
77

Share this post

DataEngineer.io Newsletter
DataEngineer.io Newsletter
How to save millions by optimizing data pipeline shuffling
Copy link
Facebook
Email
Notes
More
8
3
Share

Shuffling isn’t just a fancy dance move! Shuffling is caused when you try to aggregate or join datasets in distributed environments like Spark or BigQuery. One time when I was working at Facebook, we had a 50 TB table joining with a 150 TB table. The shuffle caused by that join took up 30% of all of our compute! I eliminated that shuffle by bucketing th…

Keep reading with a 7-day free trial

Subscribe to DataEngineer.io Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Zach Wilson
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More