Change Data Capture (CDC) can simplify and improve both your application and data architectures. The trick is to figure out the most effective use cases where employing CDC will have the desired impact. In this blog, I’m going to unpack two useful CDC use cases: The first is streaming data to your data warehouse and the second is event-driven architectures. By no means are these the only two use cases for change data capture, but they are excellent examples for demonstrating the ways that CDC can simplify application and data architecture.
Change data capture is a set of technologies that allow you to identify and capture data that has changed in your database, so that you can take action using the data at a later stage.
Streaming data from your database into your data warehouse goes through a process called ETL or ELT. CDC can make this process more efficient.
Traditional ETL is based on the batch loading of data. You would achieve this by either doing a nightly job where you do one big query to extract all the data from your database to then refresh your data warehouse, or you poll your database on some periodic cadence, for instance every half hour or an hour, to get the new data and just load that new data into your data warehouse. Either way there are three big downsides to this process:
Using change data capture to stream data from your primary database to your data warehouse solves these three problems for the following reasons:
In event-driven architectures, one of the hardest things to accomplish is to safely and consistently deliver data between service boundaries. Typically, an individual service within an event-driven architecture needs to commit changes to both that service’s local database, as well as to a messaging queue, so that any messages or pieces of data that need to be sent to another service can do so. But this is challenging. What happens if your message commits to your database but not to the messaging queue? What happens if the message gets sent to the services but it doesn’t actually commit in your database?
To make this more concrete, let’s think about an imaginary (and currently impossible) social event app called MeetWithMeTomorrow. Users can go into the application, create an event, invite their friends, and then confirm that event. And when that event is confirmed, a push notification is sent to your friends so that they know where to meet you. In this mock architecture the data moves like this:
The problem with this architecture is that sometimes the kafka queue does not receive your message. When this happens, push notifications don’t get sent, but the event creator thinks they have been. That’s a confusing user experience.
Conversely, if the event makes it to the kafka queue and the push notifications get sent to that user’s friends but it doesn’t get committed to the database, the user doesn’t know that those push notifications were sent. So the users friends will show up, but the user won’t. Another bad experience.
There are workarounds that address this issue. You can add application logic to only consider an event created if it detects both that the message was committed to the event tracking database and that it was successfully published to your kafka queue. In this case, for instance, your application can be a subscriber to the kafka topic that you’re writing to, that way, it can know that the message properly propagated. The problem here is that you run into scenarios where you’re continually trying to write to your event tracking database, and the message successfully gets to the kafka queue. Eventually, that event will either be created or rolled back - but a notification has already been sent to that user’s friends even though the user sees this transaction as pending. Additionally, this adds a lot of complicated application logic that you really don’t need to achieve this goal.
A cleaner solution is to use the Outbox Pattern in conjunction with change data capture technologies. The general concept of the Outbox Pattern is that in addition to writing to the other tables within your event tracking database in each transaction, you also write to a special outbox table. And instead of synchronously writing to your kafka queue, you wait for the database transaction to commit. You point change data capture to the outbox table and emit the changes to your outbox table and to your kafka queue asynchronously.
In the MeeWithMeTomorrow example, instead of just writing to your events and your attendees tables in that same write transaction, you also will write to your outbox table. When that transaction commits, change data capture is listening to changes on that table and will emit this new event to your kafka queue which is then emitted to your notification service.
Using change data capture in event-driven architectures or to stream data to a data warehouse are just two examples of use cases in which CDC can simplify and improve your application and data architectures. There are plenty of other use cases in which CDC delivers similar value. A few that come to mind are:
Hopefully, the change data capture examples that I’ve provided here are helpful. If you have any further questions please reach out in the CockroachDB Community Slack. If you’d like to see an example of an existing CockroachDB user with an event-driven architecture check out this case study.