Exporting Data with Changefeeds

Exporting Data with Changefeeds

Exporting data is a crucial tool in any database user’s toolkit. In CockroachDB, the EXPORT command has long provided this essential functionality to:

  • Move data to a different data store for business analytics
  • Migrate data to a new database
  • Archive data in a platform neutral way
  • Seed an application with data

Say, for example, you need to export a sizable amount of JSON to seed a core data store for a streaming service. Sometimes you can pre-seed this data alongside your application with an EXPORT. But if you’re working on the scale of a company like Netflix, EXPORT commands come with some limitations that led us us to explore an additional way to export data.

First of all, EXPORTs have some configurability and integration limitations — but these are minor issues compared to two other problems:

  • EXPORT commands have a natural scale limit: above a certain table size (TBs of data), EXPORTs are likely to run into snags that cause them to restart. Since they must restart from the beginning, it is unlikely a very high data scale EXPORT will ever finish.
  • EXPORTs have limited observability compared to jobs like schema changes and backups, since they aren’t actually jobs themselves.

Now there is a solution: In our latest release, CockroachDB 22.1, we’re introducing a new way to export data using changefeeds (available in CockroachDB Enterprise, our fully managed offering, CockroachDB Dedicated, and CockroachDB Serverless).

(CockroachDB core users are invited to try an experimental feature: EXPERIMENTAL CHANGEFEED FOR. This statement allows users to create a new changefeed to stream row-level changes to the client indefinitely, until the underlying connection is closed or the changefeed is canceled).

Exporting data with changefeeds

Enterprise change data capture (CDC) provides row-level change subscriptions for downstream processing. Now, in CockroachDB 22.1, users can create a new changefeed to stream row-level changes in a configurable format to a configurable sink.

CREATE CHANGEFEED FOR table
INTO [kafka, webhook, cloud 
storage, gc pubsub] WITH 
format=csv, initial_scan=only;

The CREATE CHANGEFEED statement creates a new CockroachDB enterprise changefeed, which targets an allowlist of tables called “watched rows”. Every change to a watched row is emitted as a record in a configurable format (JSON or Avro) to a configurable sink (KafkaGoogle Cloud Pub/Sub, a cloud storage sink, or a webhook sink). You can createpauseresumealter, or cancel an Enterprise changefeed.

In the base case, changefeeds send a stream of continuous updates to a table via a SQL connection or external system. However, using the initial_scan=only option, a changefeed will emit a snapshot of the table and the job will complete. 

Using changefeeds allows us to take advantage of a wider set of integrations and configurability, as well as the robust-at-scale changefeed architecture. 

Automatic data scaling

The huge advantage: in CockroachDB Enterprise (and CockroachDB Dedicated), changefeeds are built to scale organically with your data. How it works:

  • Changefeeds take advantage of parallel processing of data
  • Changefeeds use the jobs system, which comes with enhanced observability and scale infrastructure
  • Changefeeds utilize checkpointing to remain robust as data scales out

Let’s compare CREATE CHANGEFEED  to a traditional EXPORT:

Changefeeds EXPORT
CSV format yes yes
Parquet format not yet yes
Avro, json format yes no
Sends results of arbitrary SELECT statements not yet yes
Sends results of a SELECT * statement yes yes
Multi table support yes no
Expect duplicate messages yes no
Send to cloud storage yes yes
Send to kafka, gc pubsub, webhook yes no
Scales to 10+ TB yes no
Requires enterprise license yes no

For more details about the differences in configurability, check out the export and enterprise changefeed docs. 

Changefeeds offer a great alternative to EXPORTs for users at a high scale of data. There are still a few limitations compared to the EXPORT command, but changefeeds are quickly approaching feature parity. Moreover, changefeeds offer many configuration and destination options beyond those offered by EXPORTs.

Keep Reading

Change Data Capture: Fine Tuning Changefeeds for Performance and Durability

NOTE: This blog requires a fairly in-depth understanding of your application and changefeeds. If you want to learn more …

Read More
An experiment in fuzzy matching, using SQL, with CockroachDB

A recent tweet inspired me to address the need for fuzzy matching by combining some existing capabilities of …

Read More
When (& Why) You Should Use Change Data Capture

Change Data Capture (CDC) can simplify and improve both your application and data architectures. The trick is to figure …

Read More
x
Developer Resources