I first worked with >100 TB pipelines when I joined the Core Growth team at Facebook in 2016 back when it was the best place to work. The first three months were filled with ice cream, bike rides, and lots of fun! Then suddenly my dream paradise turned intense when my boss,
Very appreciative of all the DE’s continuously finding ways to access the data for analysis easier.
Those were such hairy tables to deal with. I had to deal with so many different reachability analysis that hit these tables joining them to other equally hairy advertiser tables.
How to process extremely large (> 100TBs) data sets without burning millions
I really found it useful 😁
Great post. The dos and donts is my favorite part.
This was super cool to learn about, even as someone mostly focused in frontend land. Thanks, Zach
Seems to me at 100TB Spark will take a lot of time reading the data source before in-memory processing begins. How is this startup time accounted for?
Very appreciative of all the DE’s continuously finding ways to access the data for analysis easier.
Those were such hairy tables to deal with. I had to deal with so many different reachability analysis that hit these tables joining them to other equally hairy advertiser tables.
Thanks for the great post. It's not every time we would get to work at that scale, so it's nice to get the distilled findings.