Shuffling isn’t just a fancy dance move! Shuffling is caused when you try to aggregate or join datasets in distributed environments like Spark or BigQuery. One time when I was working at Facebook, we had a 50 TB table joining with a 150 TB table. The shuffle caused by that join took up 30% of all of our compute! I eliminated that shuffle by bucketing the tables ahead of time and then joining!
This was great. I do lots of complex joining across big data tables in our data lake within data warehouses in order to aggregate real time data. I’d like to add the distributing your dataflow such that you can aggregate parts of the data separately into intermediate tables then rejoin it later on allows you to minimize disk spillage. As mentioned it’s compute and transfer costs that are expensive not storage.
Hi Zach, I working in Amazon and recently we got the exact same query performance problem. We are using RDS PostgreSQL db and we are bucketing our table by month. But our query pattern is user can select max 2 years time frame and group by seller id and product id to get total sales in max last 2 years. And the queries for big sellers which have tons of product ids for long time frame is really slow, it took 4mins to run. Then based on your suggestion, I think we can instead of bucketing by date, we should bucketing by seller id to put all data related to this seller in one machine. But question is:1. How do we handle skew data, means some sellers can have million products for 2 years of data but some sellers only have few products for 2 years , this will make some bucket are really big and have bad performance still . 2. Is this a good practice we have thousands buckets in RDS?
This was great. I do lots of complex joining across big data tables in our data lake within data warehouses in order to aggregate real time data. I’d like to add the distributing your dataflow such that you can aggregate parts of the data separately into intermediate tables then rejoin it later on allows you to minimize disk spillage. As mentioned it’s compute and transfer costs that are expensive not storage.
Hi Zach, I working in Amazon and recently we got the exact same query performance problem. We are using RDS PostgreSQL db and we are bucketing our table by month. But our query pattern is user can select max 2 years time frame and group by seller id and product id to get total sales in max last 2 years. And the queries for big sellers which have tons of product ids for long time frame is really slow, it took 4mins to run. Then based on your suggestion, I think we can instead of bucketing by date, we should bucketing by seller id to put all data related to this seller in one machine. But question is:1. How do we handle skew data, means some sellers can have million products for 2 years of data but some sellers only have few products for 2 years , this will make some bucket are really big and have bad performance still . 2. Is this a good practice we have thousands buckets in RDS?
Postgres doesn’t shuffle so this post isnt applicable to RDS
Then what kind of db you are referring to here? And do you have any suggestions on my above query issue?
Oh do you mean since RDS is not really distributed system then that’s why there is no shuffle between machines?
Correct
ARRAY<STRUCT<post_id, timestamp» => is this supported in any RDBMS?
Wow! Shuffling word brought goosebumps!
I had to create a pipeline with Spark Job and datalake once which deal terabytes of regular telemetry and their master data.
Depulicating, Joining and ordering at the end caused the job to run more than a day.
Exatcly! Bucketing first simplified rest of the operations and job finished in around 1 hour!