If you go back ten years, you’ll see that daily dimensional snapshotting ruled (and continues to rule) the world.
There are tons of drawbacks to daily snapshots such as:
What about intraday changes?
They get squashed and forgotten about
What about pipelines that require the most up-to-date dimensions in a low latency way?
Low-latency use cases either accepted slightly inaccurate dimensions or relied on “low-latency” stores of dimensions
Generating slowly changing dimensions (SCDs) requires a lot of additional work and isn’t functional. (more on this here)
In this article, we will talk about:
What is change data capture and why is it so awesome?
What are the three best ways to do CDC?
In the production database
In the API layer
In the warehouse layer (i.e. using change streams with something like Snowflake)
What is change data capture and why is it so awesome?
Change data capture is logging all the changes that happen to your dimensions. A dimension mutation is logged and journaled.
For example, say I have an app where my favorite food is lasagna and I go to my profile page and update it to curry. CDC would capture this fact and log it somewhere.
These types of immutable logs are extremely powerful!
We can recreate slowly changing dimensions with simple window functions
We can know EXACTLY what a value of a dimension was at any moment in time
These types of advantages make the old world of daily dimensional snapshotting look archaic and so 1990s.
Implementing CDC in a production database
The simplest and most complete way to implement CDC is using database triggers in Postgres.
You first create a “journaling” table like this:
CREATE TABLE IF NOT EXISTS user_cdc (
cdc_id BIGSERIAL PRIMARY KEY,
operation TEXT NOT NULL,
user_id INTEGER,
old_data JSONB,
new_data JSONB,
changed_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
Then you need to create the function that adds data to the CDC table
CREATE OR REPLACE FUNCTION cdc_users()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO user_cdc (operation, user_id, new_data)
VALUES ('INSERT', NEW.id, to_jsonb(NEW));
RETURN NEW;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO user_cdc (operation, user_id, old_data, new_data)
VALUES ('UPDATE', OLD.id, to_jsonb(OLD), to_jsonb(NEW));
RETURN NEW;
ELSIF TG_OP = 'DELETE' THEN
INSERT INTO user_cdc (operation, user_id, old_data)
VALUES ('DELETE', OLD.id, to_jsonb(OLD));
RETURN OLD;
END IF;
END;
$$;
Then finally you need to add the database trigger that wires up the change logging:
CREATE TRIGGER cdc_users_trigger
AFTER INSERT OR UPDATE OR DELETE
ON users
FOR EACH ROW
EXECUTE FUNCTION cdc_users();
It’s really this simple!
This flavor of CDC has a lot of benefits:
It’s really close to production and will capture 100% of the dimensional changes
You get very fine-grained control over how you manage the changes
It has some drawbacks too:
Database triggers can get expensive and hurt production performance
You need a separate job that pulls the data out of Postgres into the data lake which puts further pressure on the production database!
Implementing CDC in a production API
The second way to do CDC is via capturing API requests! This is one layer above the database which allows for more flexibility of where this data!
This is done by via a REST api and logs the change data to Kafka.
This flavor of CDC is much more scalable because:
it puts no additional pressure on the production (with triggers or additional production scraping jobs)
This version of CDC is less-than-ideal because:
What about mutations that don’t happen via the API?
Setting up Kafka to capture CDC changes is a pain (a webhook or something similar could also work)
Implementing CDC in Snowflake
You can also create a change stream directly in Snowflake! It’s very easy!
CREATE STREAM changes ON TABLE bootcamp.table;
Change streams in Snowflake have a bunch of caveats you need to be aware of though!
If two mutations happen between when you read the change stream last, only the latest mutation is stored and it squashes intraday changes! Depending on your use case that could be a good thing or a bad thing!
This version of CDC is EXTREMELY easy to setup and requires zero software engineering or touching production. You may opt for this if you’re a data engineer who is stuck working with data that is already in the data lake!
Comparing the three CDC methods
Which version of CDC would you use in your company? I personally use database triggers for DataExpert.io! You can use the code CDC to get 30% off the DataExpert.io subscription academy!
Also for Lunar New Year, I’m offering 50% off an annual subscription to this newsletter! For only $37.50 you can get delicious updates like these!
Here’s the link to that offer:https://blog.dataengineer.io/LUNAR
What other types of content would you want to see me share? Are you going to implement CDC at your company and wow your engineering managers?
I would not recommend any of these options.
- Database triggers: poor performance, hard to manage. And, as you mentioned, doesn't actually solve the problem of delivering data to the data lake.
- Dual writes with Kafka: it's an anti-pattern! Don't do it. It's really hard to get consistency right without implementing 2PC protocol (what if your Kafka write fails, but the user action succeeds? or vice versa? and if you actually wait for a Kafka write ACK for every request, you're adding a huge system dependency; again, don't do it).
- Snowflake change streams: I wouldn't consider it a Change Data Capture technique in the context of data integration. It's just a feature of Snowflake.
What you want to look into: tools that implement log-based CDC by leveraging logical replication in Postgres, MySQL, etc. For example: Debezium, Flink CDC, AWS DMS.
Check many CDC posts and presentations from Gunnar Morling (former Debezium lead), e.g.:
- https://www.infoq.com/presentations/cdc-microservices/
- https://www.decodable.co/blog/cdc-use-cases
Also, a great overview: https://www.confluent.io/events/kafka-summit-london-2024/debezium-vs-the-world-an-overview-of-the-cdc-ecosystem/
Great article. As a software engineer, where in several roles I’ve had to investigate data changes & integrity pretty far in the past, a lack of CDC definitely brings a lot of pain, especially when you need to prove that something was exactly one way at a given point in time in the past!
Coming from a SWE background, I tend to love the Kafka-based version of CDC. One trick I’ve used quite a bit to not sacrifice latency is the fire-and-forget pattern. When an endpoint converts to this method (and you can still have async retries after the server has responded to an HTTP request), it only adds 2-3 ms latency (max), which is nearly unnoticeable to a human user.