CockroachDB backups operate as jobs, which are potentially long-running operations that could span multiple SQL sessions. Unlike regular SQL statements, which CockroachDB routes to the optimizer for processing, a
BACKUP statement will move into a job workflow. A backup job has four main phases:
At a high level, CockroachDB performs the following tasks when running a backup job:
- Validates the parameters of the
BACKUPstatement then writes them to a job in the jobs system describing the backup.
- Based on the parameters recorded in the job, and any previous backups found in the storage location, determines which key spans and time ranges of data in the storage layer need to be backed up.
- Instructs various nodes in the cluster to each read those keys and writes the row data to the backup storage location.
- Records any additional metadata about what was backed up to the backup storage location.
The following diagram illustrates the flow from
BACKUP statement through to a complete backup in cloud storage:
Job creation phase
A backup begins by validating the general sense of the proposed backup.
Let's take the following statement to start the backup process:
BACKUP DATABASE movr INTO 's3://bucket' WITH revision_history;
CockroachDB will verify the options passed in the
BACKUP statement and check that the user has the required privileges to take the backup. The tables are identified along with the set of key spans to back up. In this example, the backup will identify all the tables within the
movr database and note that the
revision_history option is required.
The ultimate aim of the job creation phase is to complete all of these checks and write the detail of what the backup job should complete to a job record.
detached backup was requested, the
BACKUP statement is complete as it has created an uncommitted, but otherwise ready-to-run backup job. You'll find the job ID returned as output. Without the
detached option, the job is committed and the statement waits to return the results until the backup job starts, runs (as described in the following sections), and terminates.
Once the job record is committed, the cluster will try to run the backup job even if a client disconnects or the node handling the
BACKUP statement terminates. From this point, the backup is a persisted job that any node in the cluster can take over executing to ensure it runs. The job record will move to the system jobs table, ready for a node to claim it.
Once one of the nodes has claimed the job from the system jobs table, it will take the job record’s information and outline a plan. This node becomes the coordinator. In our example, Node 2 becomes the coordinator and starts to complete the following to prepare and resolve the targets for distributed backup work:
- Test the connection to the storage bucket URL (
- Determine the specific subdirectory for this backup, including if it should be incremental from any discovered existing directories.
- Calculate the keys of the backup data, as well as the time ranges if the backup is incremental.
- Determine the leaseholder nodes for the keys to back up.
- Provide a plan to the nodes that will execute the data export (typically the leaseholder node).
To map out the storage location's directory where the nodes will write the data, the coordinator identifies the type of backup. This determines the name of the new (or edited) directory to store the backup files in. For example, if there is an existing full backup already exists in the target storage location, the next backup will be incremental and therefore will be appended to the full backup after any existing incremental layers discovered in it.
To restrict the execution of the job to nodes that match a specific locality filter, you can set the
EXECUTION LOCALITY option. For detail on how this option affects the process of a backup job and an example, refer to Take Locality-restricted backups.
For more information on how CockroachDB structures backups in storage, refer to Backup collections.
Key and time range resolution
In this part of the resolution phase, the coordinator will calculate all the necessary spans of keys and their time ranges that the cluster needs to export for this backup. It divides the key spans based on which node is the leaseholder of the range for that key span. Every node has a SQL processor on it to process the backup plan that the coordinator will pass to it. Typically, it is the backup SQL processor on the leaseholder node for the key span that will complete the export work.
Each of the node's backup SQL processors are responsible for:
- Asking the storage layer for the content of each key span.
- Receiving the content from the storage layer.
- Writing it to the backup storage location or locality-specific location (whichever locality best matches the node).
Since any node in a cluster can become the coordinator and all nodes could be responsible for exporting data during a backup, it is necessary that all nodes can connect to the backup storage location.
Once the coordinator has provided a plan to each of the backup SQL processors that specifies the backup data, the distributed export of the backup data begins.
In the following diagram, Node 2 and Node 3 contain the leaseholders for the R1 and R2 ranges. Therefore, in this example backup job, the backup data will be exported from these nodes to the specified storage location.
While processing, the nodes emit progress data that tracks their backup work to the coordinator. In the diagram, Node 3 will send progress data to Node 2. The coordinator node will then aggregate the progress data into checkpoint files in the storage bucket. The checkpoint files provide a marker for the backup to resume after a retryable state, such as when it has been paused.
Metadata writing phase
Once each of the nodes have fully completed their data export work, the coordinator will conclude the backup job by writing the backup metadata files. In the diagram, Node 2 is exporting the backup data for R1 as that range's leaseholder, but this node also exports the backup's metadata as the coordinator.
The backup metadata files describe everything a backup contains. That is, all the information a restore job will need to complete successfully. A backup without metadata files would indicate that the backup did not complete properly and would not be restorable.
With the full backup complete, the specified storage location will contain the backup data and its metadata ready for a potential restore. After subsequent backups of the
movr database to this storage location, CockroachDB will create a backup collection. Refer to Backup collections for information on how CockroachDB structures a collection of multiple backups.
Backup jobs with locality
CockroachDB supports two backup features that use a node's locality to determine how a backup job runs or where the backup data is stored. This section provides a technical overview of how the backup job process works for each of these backup features:
- Locality-aware backup: Partition and store backup data in a way that is optimized for locality. This means that nodes write backup data to the cloud storage bucket that is closest to the node's locality.
- Locality-restricted backup execution: Specify a set of locality filters for a backup job in order to restrict the nodes that can participate in the backup process to that locality. This ensures that the backup job is executed by nodes that meet certain requirements, such as being located in a specific region or having access to a certain storage bucket.
Job coordination and export of locality-aware backups
When you create a locality-aware backup job, any node in the cluster can claim the backup job. A successful locality-aware backup job requires that each node in the cluster has access to each storage location. This is because any node in the cluster can claim the job and become the coordinator node. Once each node informs the coordinator node that it has completed exporting the row data, the coordinator will start to write metadata, which involves writing to each locality bucket a partial manifest recording what row data was written to that storage bucket.
Every node backing up a range will back up to the storage bucket that most closely matches that node's locality. The backup job attempts to back up ranges through nodes matching the range's locality. As a result, there is no guarantee that all ranges will be backed up to their locality's storage bucket. For additional detail on locality-aware backups in the context of a CockroachDB Serverless cluster, refer to Job coordination on Serverless clusters.
The node exporting the row data, and the leaseholder of the range being backed up, are usually the same. However, these nodes can differ when lease transfers have occurred during the execution of the backup. In this case, the leaseholder node returns the files to the node exporting the backup data (usually a local transfer), which then writes the file to the external storage location with a locality that usually matches its own localities (with an overall preference for more specific values in the locality hierarchy). If there is no match, the
default locality is used.
For example, in the following diagram there is a three-node cluster split across three regions. The leaseholders write the ranges to be backed up to the external storage in the same region. As Nodes 1 and 3 complete their work, they send updates to the coordinator node (Node 2). The coordinator will then write the partial manifest files containing metadata about the backup work completed on each external storage location, which is stored with the backup SST files written to that storage location.
During a restore job, the job creation statement will need access to each of the storage locations to read the metadata files in order to complete a successful restore.
Job coordination on Serverless clusters
CockroachDB Serverless clusters operate with a different architecture compared to CockroachDB Self-Hosted and CockroachDB Dedicated clusters. These architectural differences have implications for how locality-aware backups can run. Serverless clusters will scale resources depending on whether they are actively in use, which means that it is less likely to have a SQL pod available in every locality. As a result, your Serverless cluster may not have a SQL pod in the locality where the data resides, which can lead to the cluster uploading that data to a storage bucket in a locality where you do have active SQL pods. You should consider this as you plan a backup strategy that must comply with data domiciling requirements.
Job coordination using the
EXECUTION LOCALITY option
When you start or resume a backup with
EXECUTION LOCALITY, the backup job must determine the coordinating node for the job. If a node that does not match the locality filter is the first node to claim the job, the node is responsible for finding a node that does match the filter and transferring the execution to it. This transfer can result in a short delay in starting or resuming a backup job that has execution locality requirements.
If you create a backup job on a gateway node with a locality filter that does not meet the filter requirement in
EXECUTION LOCALITY, and the job does not use the
DETACHED option, the job will return an error indicating that it moved execution to another node. This error is returned because when you create a job without the
DETACHED option, the job execution must run to completion by the gateway node while it is still attached to the SQL client to return the result.
Once the coordinating node is determined, it will assign chunks of row data to eligible nodes, and each node reads its assigned row data and backs it up. The coordinator will assign row data only to those nodes that match the backup job's locality filter in full. The following situations could occur:
- If the leaseholder for part of the row data matches the filter, the coordinator will assign it the matching row data to process.
- If the leaseholder does not match the locality filter, the coordinator will select a node from the eligible nodes with a preference for those with localities that are closest to the leaseholder.
When the coordinator assigns row data to a node matching the locality filter to back up, that node will read from the closest replica. If the node is the leaseholder, or is itself a replica, it can read from itself. In the scenario where no replicas are available in the region of the assigned node, it may then read from a replica in a different region. As a result, you may want to consider placing replicas, including potentially non-voting replicas that will have less impact on read latency, in the locality or region you plan on pinning for backup job execution.
- CockroachDB's general Architecture Overview
- Storage layer
- Take Full and Incremental Backups