When (& Why) You Should Use Change Data Capture

When (& Why) You Should Use Change Data Capture

Change Data Capture (CDC) can simplify and improve both your application and data architectures. The trick is to figure out the most effective use cases where employing CDC will have the desired impact. In this blog, I’m going to unpack two useful CDC use cases: The first is streaming data to your data warehouse and the second is event-driven architectures. By no means are these the only two use cases for change data capture, but they are excellent examples for demonstrating the ways that CDC can simplify application and data architecture. 

What is Change Data Capture (CDC)?

Change data capture is a set of technologies that allow you to identify and capture data that has changed in your database, so that you can take action using the data at a later stage. 

Use CDC For Streaming Data to Your Data Warehouse

Streaming data from your database into your data warehouse goes through a process called ETL or ELT. CDC can make this process more efficient. 

What are ETL & ELT?

  • ETL stands for Extract Transform Load whereby you take the data from your primary database, extract it, do some data transformations on it (aggregations or joins) and then put those into your data warehouse for the purposes of analytics queries. 
  • ELT is a more common concept these days, where instead of transforming before you load, you actually load the raw data into your data warehouse -then do those aggregations and joins later. 

Batch Data vs Streaming Data

Traditional ETL is based on the batch loading of data. You would achieve this by either doing a nightly job where you do one big query to extract all the data from your database to then refresh your data warehouse, or you poll your database on some periodic cadence, for instance every half hour or an hour, to get the new data and just load that new data into your data warehouse. Either way there are three big downsides to this process:

  1. Periodic spikes in load: These large queries impact the latency and ultimately the user experience, which is why a lot of companies tend to schedule spikes in low traffic periods. 
  2. Network provisioning: Sending all that data puts a lot of strain on your network. And because you have big spikes in network costs and bytes that you’re sending over the network, you have to provision your network to be able to handle peak traffic and peak batch sending of data.
  3. Delayed business decisions: Business decisions based on the data are delayed by your polling frequency. So if you update your data every night that means you can’t query what happened yesterday until the next day. 

Using change data capture to stream data from your primary database to your data warehouse solves these three problems for the following reasons:

  1. CDC does not require that you execute high load queries on a periodic basis, so you don’t get really spiky behaviors in load. While changefeeds are not free, they are cheaper and they are spread out evenly throughout the day. 
  2. Because the data is sent continuously and in much smaller batches, you don’t need to provision as much network in order to make that work, and you can save money on network costs. 
  3. Because you’re continuously streaming data from your database to your data warehouse, the data in your warehouse is up-to-date, allowing you to create real-time insights, giving you a leg up on your competitors because you’re making business decisions on fresher data. 

Use CDC for Event-Driven Architectures

In event-driven architectures, one of the hardest things to accomplish is to safely and consistently deliver data between service boundaries. Typically, an individual service within an event-driven architecture needs to commit changes to both that service’s local database, as well as to a messaging queue, so that any messages or pieces of data that need to be sent to another service can do so. But this is challenging. What happens if your message commits to your database but not to the messaging queue? What happens if the message gets sent to the services but it doesn’t actually commit in your database? 

To make this more concrete, let’s think about an imaginary (and currently impossible) social event app called MeetWithMeTomorrow. Users can go into the application, create an event, invite their friends, and then confirm that event. And when that event is confirmed, a push notification is sent to your friends so that they know where to meet you. In this mock architecture the data moves like this:

  1. The user creating the event will send the event to both the event tracking database as well as to a kafka messaging queue, 
  2. That would then propagate that over to the notification service,  
  3. Which would then send out the push notifications to each of the individuals that you invited. 

The problem with this architecture is that sometimes the kafka queue does not receive your message. When this happens, push notifications don’t get sent, but the event creator thinks they have been. That’s a confusing user experience. 

Conversely, if the event makes it to the kafka queue and the push notifications get sent to that user’s friends but it doesn’t get committed to the database, the user doesn’t know that those push notifications were sent. So the users friends will show up, but the user won’t. Another bad experience. 

There are workarounds that address this issue. You can add application logic to only consider an event created if it detects both that the message was committed to the event tracking database and that it was successfully published to your kafka queue. In this case, for instance, your application can be a subscriber to the kafka topic that you’re writing to, that way, it can know that the message properly propagated. The problem here is that you run into scenarios where you’re continually trying to write to your event tracking database, and the message successfully gets to the kafka queue. Eventually, that event will either be created or rolled back - but a notification has already been sent to that user’s friends even though the user sees this transaction as pending. Additionally, this adds a lot of complicated application logic that you really don’t need to achieve this goal. 

A cleaner solution is to use the Outbox Pattern in conjunction with change data capture technologies. The general concept of the Outbox Pattern is that in addition to writing to the other tables within your event tracking database in each transaction, you also write to a special outbox table. And instead of synchronously writing to your kafka queue, you wait for the database transaction to commit. You point change data capture to the outbox table and emit the changes to your outbox table and to your kafka queue asynchronously. 

In the MeeWithMeTomorrow example, instead of just writing to your events and your attendees tables in that same write transaction, you also will write to your outbox table. When that transaction commits, change data capture is listening to changes on that table and will emit this new event to your kafka queue which is then emitted to your notification service. 

Three benefits of using the Outbox Pattern with Change Data Capture: 

  1. Life is short - avoid complicated application logic. 
  2. By using an outbox you have a history of the events that should have been emitted. This allows you to audit what your application is doing, but also, if your kafka server is down, or you run into other issues, you can always replay those events. So you never lose any messages that should have been sent to your notification service. 
  3. You can tailor the data stored in your outbox table that is ultimately sent downstream to your services in the format that is most conducive to the consumer. This also means that if you change the schema of your events or your attendees' table, you don’t necessarily have to change what is put into the outbox table - this reduces the system interface churn. 

More Use Cases for Change Data Capture

Using change data capture in event-driven architectures or to stream data to a data warehouse are just two examples of use cases in which CDC can simplify and improve your application and data architectures. There are plenty of other use cases in which CDC delivers similar value. A few that come to mind are:

  • Streaming updates of your search indexes 
  • Streaming analytics and anomaly detection 
  • Streaming updates to your online data stores for your production machine learning models 

Hopefully, the change data capture examples that I’ve provided here are helpful. If you have any further questions please reach out in the CockroachDB Community Slack. If you’d like to see an example of an existing CockroachDB user with an event-driven architecture check out this case study.