Change data capture (CDC) provides efficient, distributed, row-level changefeeds into a configurable sink for downstream processing such as reporting, caching, or full-text indexing.
What is change data capture?
While CockroachDB is an excellent system of record, it also needs to coexist with other systems. For example, you might want to keep your data mirrored in full-text indexes, analytics engines, or big data pipelines.
The main feature of CDC is the changefeed, which targets an allowlist of tables, called the "watched rows". There are two implementations of changefeeds:
|Core changefeeds||Enterprise changefeeds|
|Useful for prototyping or quick testing.||Recommended for production use.|
|Available in all products.||Available in CockroachDB Dedicated or with an Enterprise license in CockroachDB Self-Hosted or CockroachDB Serverless.|
|Streams indefinitely until underlying SQL connection is closed.||Maintains connection to configured sink (Kafka, Google Cloud Pub/Sub, Amazon S3, Google Cloud Storage, Azure Storage, HTTP, Webhook).|
New in v23.1: Create a scheduled changefeed with
New in v23.1: Use
|Watches one or multiple tables in a comma-separated list. Emits every change to a "watched" row as a record.||Watches one or multiple tables in a comma-separated list. Emits every change to a "watched" row as a record in a configurable format (JSON, CSV, Avro) to a configurable sink (e.g., Kafka).|
||Manage changefeed with
Refer to Ordering Guarantees for detail on CockroachDB's at-least-once-delivery-guarantee as well as explanation on how rows are emitted.
The Advanced Changefeed Configuration page provides detail and recommendations for improving changefeed performance.
How does an Enterprise changefeed work?
When an Enterprise changefeed is started on a node, that node becomes the coordinator for the changefeed job (Node 2 in the diagram). The coordinator node acts as an administrator: keeping track of all other nodes during job execution and the changefeed work as it completes. The changefeed job will run across all nodes in the cluster to access changed data in the watched table. Typically, the leaseholder for a particular range (or the range’s replica) determines which node emits the changefeed data.
Each node uses its aggregator processors to send back checkpoint progress to the coordinator, which gathers this information to update the high-water mark timestamp. The high-water mark acts as a checkpoint for the changefeed’s job progress, and guarantees that all changes before (or at) the timestamp have been emitted. In the unlikely event that the changefeed’s coordinating node were to fail during the job, that role will move to a different node and the changefeed will restart from the last checkpoint. If restarted, the changefeed will send duplicate messages starting at the high-water mark time to the current time. See Ordering Guarantees for detail on CockroachDB's at-least-once-delivery-guarantee as well as an explanation on how rows are emitted.
resolved specified when a changefeed is started, the coordinator will send the resolved timestamp (i.e., the high-water mark) to each endpoint in the sink. For example, when using Kafka this will be sent as a message to each partition; for cloud storage, this will be emitted as a resolved timestamp file.
As rows are updated, added, and deleted in the targeted table(s), the node sends the row changes through the rangefeed mechanism to the changefeed encoder, which encodes these changes into the final message format. The message is emitted from the encoder to the sink—it can emit to any endpoint in the sink. In the diagram example, this means that the messages can emit to any Kafka Broker.
If you are running changefeeds from a multi-region cluster, you may want to define which nodes take part in running the changefeed job. You can use the
execution_locality option with key-value pairs to specify the locality requirements nodes must meet. See Job coordination using the execution locality option for detail on how a changefeed job works with this option.
See the following for more detail on changefeed setup and use:
- Changefeed target options are limited to tables and column families. Tracking GitHub Issue
- Using a cloud storage sink only works with
JSONand emits newline-delimited json files. Tracking GitHub Issue
- Webhook sinks only support HTTPS. Use the
INSECURE_TLS_SKIP_VERIFYparameter when testing to disable certificate verification; however, this still requires HTTPS and certificates. Tracking GitHub Issue
- Webhook sinks only support emitting
JSON. Tracking GitHub Issue
- There is no concurrency configurability for webhook sinks. Tracking GitHub Issue
- Using the
resolvedoptions on the same changefeed will cause an error when using the following sinks: Kafka and Google Cloud Pub/Sub. Instead, use the individual
FAMILYkeyword to specify column families when creating a changefeed. Tracking GitHub Issue
- If a changefeed with
on_error='pauseis running when a watched table is truncated, the changefeed will pause but will not be able to resume reads from that table. Using
ALTER CHANGEFEEDto drop the table from the changefeed and then resuming the job will work, but you cannot add the same table to the changefeed again. Instead, you will need to create a new changefeed for that table. Tracking GitHub Issue
- Changefeed types are not fully integrated with user-defined composite types. Running changefeeds with user-defined composite types is in Preview. Certain changefeed types do not support user-defined composite types. Refer to the change data capture Known Limitations for more detail.. The following limitations apply:
- A changefeed in Avro format will not be able to serialize user-defined composite (tuple) types. Tracking GitHub Issue
- A changefeed emitting CSV will include
ASlabels in the message format when the changefeed serializes a user-defined composite type. Tracking GitHub Issue