8 Comments
Mar 12Liked by Zach Wilson

This was great. I do lots of complex joining across big data tables in our data lake within data warehouses in order to aggregate real time data. I’d like to add the distributing your dataflow such that you can aggregate parts of the data separately into intermediate tables then rejoin it later on allows you to minimize disk spillage. As mentioned it’s compute and transfer costs that are expensive not storage.

Expand full comment

Hi Zach, I working in Amazon and recently we got the exact same query performance problem. We are using RDS PostgreSQL db and we are bucketing our table by month. But our query pattern is user can select max 2 years time frame and group by seller id and product id to get total sales in max last 2 years. And the queries for big sellers which have tons of product ids for long time frame is really slow, it took 4mins to run. Then based on your suggestion, I think we can instead of bucketing by date, we should bucketing by seller id to put all data related to this seller in one machine. But question is:1. How do we handle skew data, means some sellers can have million products for 2 years of data but some sellers only have few products for 2 years , this will make some bucket are really big and have bad performance still . 2. Is this a good practice we have thousands buckets in RDS?

Expand full comment
author

Postgres doesn’t shuffle so this post isnt applicable to RDS

Expand full comment

Then what kind of db you are referring to here? And do you have any suggestions on my above query issue?

Expand full comment

Oh do you mean since RDS is not really distributed system then that’s why there is no shuffle between machines?

Expand full comment
author

Correct

Expand full comment

ARRAY<STRUCT<post_id, timestamp» => is this supported in any RDBMS?

Expand full comment

Wow! Shuffling word brought goosebumps!

I had to create a pipeline with Spark Job and datalake once which deal terabytes of regular telemetry and their master data.

Depulicating, Joining and ordering at the end caused the job to run more than a day.

Exatcly! Bucketing first simplified rest of the operations and job finished in around 1 hour!

Expand full comment