How CockroachDB implemented consistent, distributed, incremental backup

Consistent, Distributed, Incremental Backup: Pick Three

Almost all widely used database systems include the ability to backup and restore a snapshot of their data. The replicated nature of CockroachDB’s distributed architecture means that the cluster survives the loss of disks or nodes, and yet many users still want to make regular backups. This led us to develop distributed backup and restore, the first feature available in our CockroachDB Self-hosted offering.

When we set out to work on this feature, the first thing we did was figure out why customers wanted it. The reasons we discovered included a general sense of security, “Oops I dropped a table”, finding a bug in new code only when it’s deployed, legally required data archiving, and the “extract” phase of an ETL pipeline. So as it turns out, even in a system that was built to never lose your data, backup is still a critical feature for many of our customers.

At the same time, we brainstormed whether CockroachDB’s unique architecture allowed any improvements to the status quo. In the end, we felt it was important that both backup and restore be consistent across nodes (just like our SQL), distributed (so it scales as your data scales), and incremental (to avoid wasting resources).

Additionally, we knew that backups need to keep only a single copy of each piece of data and should impact production traffic as little as possible. You can see the full list of goals and non-goals in the Backup & Restore RFC.

In this post, we’ll focus on backup and how we made it work.

Step 0: Why We Reinvented the Wheel

One strategy for implementing backup is to take a snapshot of the database’s files, which is how a number of other systems work. CockroachDB uses RocksDB as its disk format and RocksDB already has a consistent backup feature, which would let us do consistent backups without any particular filesystem support for snapshots of files. Unfortunately, because CockroachDB does such a good job of balancing and replicating your data evenly across all nodes, there’s not a good way to use RocksDB’s backup feature without saving multiple copies of every piece of data.

Step 1: Make Backup Consistent

Correctness is the foundation of everything we do here at Cockroach Labs. We believe that once you have correctness, then stability and performance will follow. With this in mind, when we began work on backup, we started with consistency.

Broadly speaking, CockroachDB is a SQL database built on top of a consistent, distributed key-value store. Each table is assigned a unique integer id, which is used in the mapping from table data to key-values. The table schema (which we call a TableDescriptor) is stored at key /DescriptorPrefix/<tableid>. Each row in the table is stored at key /<tableid>/<primarykey>. (This is a simplification; the real encoding is much more complicated and efficient than this. For full details see the Table Data blog post).

I’m a big fan of pre-RFC exploratory prototypes, so the first version of backup used the existing Scan primitive to fetch the table schema and to page through the table data (everything with a prefix of /<tableid>). This was easy, quick, and it worked!

It also meant the engineering work was now separable. The [SQL syntax for BACKUP], the format of the backup files (described below), and RESTORE could now be divvied up among the team members.

Unfortunately, the node sending all the Scans was also responsible for writing the entire backup to disk. This was sloooowwww (less than 1 MB/s), and it didn’t scale as the cluster scaled. We built a database to handle petabytes, but this could barely handle gigabytes.

With consistency in hand, the natural next step was to distribute the work.

Step 2: Make Backup Distributed

We decided early on that backups would output their files to the storage offered by cloud providers (Amazon, Google, Microsoft, private clouds, etc). So what we needed was a command that was like Scan, except instead of returning the data, it would write it to cloud storage. And so we created Export.

[Export is a new transactionally-consistent command] that iterates over a range of data and writes it to cloud storage. Because we break up a large table and its secondary indexes into multiple pieces (called “ranges”), the request that is sent gets split up by the kv layer and sent to many nodes. The exported files use LevelDB’s SSTable as the format because it supports efficient seeking (in case we want to query the backup) and because it was already used elsewhere in CockroachDB.

Along with the exported data, a serialized backup descriptor is written with metadata about the backup, a copy of the schema of each included SQL table, and the locations of the exported data files.

Once we had a backup system that could scale to clusters with many nodes and lots of data, we had to make it more efficient. It was particularly wasteful (both cpu and storage) to export the full contents of tables that change infrequently. What we wanted was a way to write only what had changed since the last backup.

Step 3: Make Backup Incremental

CockroachDB uses MVCC. This means each of the keys I mentioned above actually has a timestamp suffix, something like /<tableid>/<primarykey>:<timestamp>. Mutations to a key don’t overwrite the current version, they write the same key with a higher timestamp. Then the old versions of each key are cleaned up after 25 hours.

To make an incremental version of our distributed backup, all we needed to do was leverage these MVCC versions. Each backup has an associated timestamp. An incremental backup simply saves any keys that have changed between its timestamp and the timestamp of the previous backup.

We plumbed these time ranges to our new Export command and voilà! Incremental backup.

One small wrinkle: if a given key (say /<customers>/<4>) is deleted, then 25 hours later when the old MVCC versions are cleaned out of RocksDB, this deletion (called a tombstone) is also collected. This means incremental backup can’t tell the difference between a key that’s never existed and one that was deleted more than 25 hours ago. As a result, an incremental backup can only run if the most recent backup was fewer than 25 hours ago (though full backups can always be run). The 25 hour period is not right for every user, so it’s configurable using replication zones.

Go Forth and Backup

Backup is run via a simple [BACKUP SQL command], and with our work to make it consistent first, then distributed and incremental, it turned out blazing fast. We’re getting about 30MB/s per node and there’s still lots of low-hanging performance fruit. It’s our first enterprise feature, so head on over to our license page to grab an evaluation license and try it out.

While CockroachDB was built to survive failures and prevent data loss, we want to make sure every team, regardless of size, has the ability to survive any type of disaster. Backup and restore were built for large clusters that absolutely need to minimize downtime, but for smaller clusters, a simpler tool will work just fine. For this, [we’ve built cockroach dump], which is available in CockroachDB Core.

What’s Next?

We have plans for a number of future projects to build on this foundation: Change Feeds for point-in-time backup and restore, read-only SQL queries over backups, an admin ui page with progress and scheduling, pause/resume/cancel control of running backups, and more.

Plus, BACKUP is worth far more with RESTORE (which turned out to be much harder and more technically interesting) and there’s a lot more that didn’t fit in this blog post, so stay tuned.