Consistent, Distributed, Incremental: Pick Three
Almost all widely used database systems include the ability to backup and restore a snapshot of their data. The replicated nature of CockroachDB’s distributed architecture means that the cluster survives the loss of disks or nodes, and yet many users still want to make regular backups. This led us to develop distributed backup and restore, the first feature available in our CockroachDB Enterprise offering.
When we set out to work on this feature, the first thing we did was figure out why customers wanted it. The reasons we discovered included a general sense of security, “Oops I dropped a table”, finding a bug in new code only when it’s deployed, legally required data archiving, and the “extract” phase of an ETL pipeline. So as it turns out, even in a system that was built to never lose your data, backup is still a critical feature for many of our customers.
At the same time, we brainstormed whether CockroachDB’s unique architecture allowed any improvements to the status quo. In the end, we felt it was important that both backup and restore be consistent across nodes (just like our SQL), distributed (so it scales as your data scales), and incremental (to avoid wasting resources).
Additionally, we knew that backups need to keep only a single copy of each piece of data and should impact production traffic as little as possible. You can see the full list of goals and non-goals in the Backup & Restore RFC.
In this post, we’ll focus on backup and how we made it work.
Step 0: Why We Reinvented the Wheel
One strategy for implementing backup is to take a snapshot of the database’s files, which is how a number of other systems work. CockroachDB uses RocksDB as its disk format and RocksDB already has a consistent backup feature, which would let us do consistent backups without any particular filesystem support for snapshots of files. Unfortunately, because CockroachDB does such a good job of balancing and replicating your data evenly across all nodes, there’s not a good way to use RocksDB’s backup feature without saving multiple copies of every piece of data.
Step 1: Make it Consistent
Correctness is the foundation of everything we do here at Cockroach Labs. We believe that once you have correctness, then stability and performance will follow. With this in mind, when we began work on backup, we started with consistency.
Broadly speaking, CockroachDB is a SQL database built on top of a consistent,
distributed key-value store. Each table is assigned a unique integer id, which
is used in the mapping from table data to key-values. The table schema (which we
call a TableDescriptor) is stored at key
row in the table is stored at key
/<tableid>/<primarykey>. (This is a
simplification; the real encoding is much more complicated and efficient than
this. For full details see the Table Data blog post).
I’m a big fan of pre-RFC exploratory prototypes, so the first version of backup
used the existing
Scan primitive to fetch the table schema and to page through
the table data (everything with a prefix of
/<tableid>). This was easy, quick,
and it worked!
It also meant the engineering work was now separable. The SQL syntax for
BACKUP, the format of the backup files (described below), and
now be divvied up among the team members.
Unfortunately, the node sending all the
Scans was also responsible for writing
the entire backup to disk. This was sloooowwww (less than 1 MB/s), and it didn’t
scale as the cluster scaled. We built a database to handle petabytes, but this
could barely handle gigabytes.
With consistency in hand, the natural next step was to distribute the work.
Step 2: Make it Distributed
We decided early on that backups would output their files to the storage offered
by cloud providers (Amazon, Google, Microsoft, private clouds, etc). So what we
needed was a command that was like
Scan, except instead of returning the data,
it would write it to cloud storage. And so we created
Export is a new transactionally-consistent command that iterates over a
range of data and writes it to cloud storage. Because we break up a large table
and its secondary indexes into multiple pieces (called “ranges”), the request
that is sent gets split up by the kv layer and sent to many nodes. The exported
files use LevelDB’s SSTable as the format because it supports efficient
seeking (in case we want to query the backup) and because it was already used
elsewhere in CockroachDB.
Along with the exported data, a serialized backup descriptor is written with metadata about the backup, a copy of the schema of each included SQL table, and the locations of the exported data files.
Once we had a backup system that could scale to clusters with many nodes and lots of data, we had to make it more efficient. It was particularly wasteful (both cpu and storage) to export the full contents of tables that change infrequently. What we wanted was a way to write only what had changed since the last backup.
Step 3: Make it Incremental
CockroachDB uses MVCC. This means each of the keys I mentioned above actually
has a timestamp suffix, something like
Mutations to a key don’t overwrite the current version, they write the same key
with a higher timestamp. Then the old versions of each key are cleaned up after
To make an incremental version of our distributed backup, all we needed to do was leverage these MVCC versions. Each backup has an associated timestamp. An incremental backup simply saves any keys that have changed between its timestamp and the timestamp of the previous backup.
We plumbed these time ranges to our new Export command and voilà! Incremental backup.
One small wrinkle: if a given key (say
/<customers>/<4>) is deleted, then 25
hours later when the old MVCC versions are cleaned out of RocksDB, this deletion
(called a tombstone) is also collected. This means incremental backup can’t tell
the difference between a key that’s never existed and one that was deleted more
than 25 hours ago. As a result, an incremental backup can only run if the most
recent backup was fewer than 25 hours ago (though full backups can always be
run). The 25 hour period is not right for every user, so it’s configurable
using replication zones.
Go Forth and Backup
Backup is run via a simple
BACKUP SQL command, and with our work to make it
consistent first, then distributed and incremental, it turned out blazing fast.
We’re getting about 30MB/s per node and there’s still lots of low-hanging
performance fruit. It’s our first enterprise feature, so head on over to our
license page to grab an evaluation license and try it out.
While CockroachDB was built to survive failures and prevent data loss, we want
to make sure every team, regardless of size, has the ability to survive any
type of disaster. Backup and restore were built for large clusters that
absolutely need to minimize downtime, but for smaller clusters, a simpler tool
will work just fine. For this, we’ve built
cockroach dump, which is available
in CockroachDB Core.
We have plans for a number of future projects to build on this foundation: Change Feeds for point-in-time backup and restore, read-only SQL queries over backups, an admin ui page with progress and scheduling, pause/resume/cancel control of running backups, and more.
BACKUP is worth far more with
RESTORE (which turned out to be much
harder and more technically interesting) and there’s a lot more that didn’t fit
in this blog post, so stay tuned.