Disaster Recovery

CockroachDB is built to be fault-tolerant and to recover automatically, but sometimes disasters happen. A disaster is any event that puts your cluster at risk, and usually means your cluster is experiencing hardware failure, data failure, or has compromised security keys. Having a disaster recovery plan enables you to recover quickly, while limiting the consequences.

Hardware failure

When planning to survive hardware failures, start by determining the minimum replication factor you need based on your fault tolerance goals:

Warning:

Increasing the replication factor can impact write performance in that more replicas must agree to reach quorum. For more details about the mechanics of writes and the Raft protocol, see Read and Writes Overview.

Note:

For the purposes of choosing a replication factor, disk failure is equivalent to node failure.

If you experience hardware failures in a cluster, the recovery actions you need to take will depend on the type of infrastructure and topology pattern used:

Single-region survivability planning

The table below shows the replication factor (RF) needed to achieve the listed fault tolerance goals for a single region, cloud-deployed cluster with nodes spread as evenly as possible across 3 availability zones (AZs):

Note:

See our basic production topology for configuration guidance.

Fault Tolerance Goals 3 nodes 5 nodes 9 nodes
1 Node RF = 3 RF = 3 RF = 3
1 AZ RF = 3 RF = 3 RF = 3
2 Nodes Not possible RF = 5 RF = 5
AZ + Node Not possible Not possible RF = 9
2 AZ Not possible Not possible Not possible

To be able to survive 2+ availability zones failing, scale to a multi-region deployment.

Single-region recovery

For hardware failures in a single-region cluster, the recovery actions vary and depend on the type of infrastructure used.

For example, consider a cloud-deployed CockroachDB cluster with the following setup:

  • Single-region
  • 3 nodes
  • A node in each availability zone (i.e., 3 AZs)
  • Replication factor of 3

The table below describes what actions to take to recover from various hardware failures in this example cluster:

Failure Availability Consequence Action to Take
1 Disk Fewer resources are available. Some data will be under-replicated until the failed nodes are marked dead.

Once marked dead, data is replicated to other nodes and the cluster remains healthy.
Restart the node with a new disk.
1 Node If the node or AZ becomes unavailable, check the Overview dashboard on the DB Console:
1 AZ
2 Nodes X Cluster is unavailable. Restart 1 of the 2 nodes that are down to regain quorum.

If you can’t recover at least 1 node, contact Cockroach Labs support for assistance.
1 AZ + 1 Node X Cluster is unavailable. Restart the node that is down to regain quorum. When the AZ comes back online, try restarting the node.

If you can’t recover at least 1 node, contact Cockroach Labs support for assistance.
2 AZ X Cluster is unavailable. When the AZ comes back online, try restarting at least 1 of the nodes.

You can also contact Cockroach Labs support for assistance.
3 Nodes X Cluster is unavailable. Restart 2 of the 3 nodes that are down to regain quorum.

If you can’t recover 2 of the 3 failed nodes, contact Cockroach Labs support for assistance.
1 Region X Cluster is unavailable.

Potential data loss between last backup and time of outage if the region and nodes did not come back online.
When the region comes back online, try restarting the nodes in the cluster.

If the region does not come back online and nodes are lost or destroyed, try restoring the latest cluster backup into a new cluster.

You can also contact Cockroach Labs support for assistance.
Note:

When using Kubernetes, recovery actions happen automatically in many cases and no action needs to be taken.

Multi-region survivability planning

Tip:

New in v21.1: By default, every multi-region database has a zone-level survival goal associated with it. The survival goal setting provides an abstraction that handles the low-level details of replica placement to ensure your desired fault tolerance. The information below is still useful for legacy deployments.

The table below shows the replication factor (RF) needed to achieve the listed fault tolerance (e.g., survive 1 failed node) for a multi-region, cloud-deployed cluster with 3 availability zones (AZ) per region and one node in each AZ:

Warning:

The chart below describes the CockroachDB default behavior when locality flags are correctly set. It does not use geo-partitioning or a specific topology pattern. For a multi-region cluster in production, we do not recommend using the default behavior, as the cluster's performance will be negatively affected.

Fault Tolerance Goals 3 Regions
(9 Nodes Total)
4 Regions
(12 Nodes Total)
5 Regions
(15 Nodes Total)
1 Node RF = 3 RF = 3 RF = 3
1 AZ RF = 3 RF = 3 RF = 3
1 Region RF = 3 RF = 3 RF = 3
2 Nodes RF = 5 RF = 5 RF = 5
1 Region + 1 Node RF = 9 RF = 7 RF = 5
2 Regions Not possible Not possible RF = 5
2 Regions + 1 Node Not possible Not possible RF = 15

Multi-region recovery

For hardware failures in a multi-region cluster, the actions taken to recover vary and depend on the type of infrastructure used.

For example, consider a cloud-deployed CockroachDB cluster with the following setup:

  • 3 regions
  • 3 AZs per region
  • 9 nodes (1 node per AZ)
  • Replication factor of 3

The table below describes what actions to take to recover from various hardware failures in this example cluster:

Failure Availability Consequence Action to Take
1 Disk Under-replicated data. Fewer resources for workload. Restart the node with a new disk.
1 Node If the node or AZ becomes unavailable check the Overview dashboard on the DB Console:
1 AZ
1 Region Check the Overview dashboard on the DB Console. If nodes are marked Dead, decommission the nodes and add 3 new nodes in a new region. Ensure that locality flags are set correctly upon node startup.
2 or More Regions X Cluster is unavailable.

Potential data loss between last backup and time of outage if the region and nodes did not come back online.
When the regions come back online, try restarting the nodes in the cluster.

If the regions do not come back online and nodes are lost or destroyed, try restoring the latest cluster backup into a new cluster.

You can also contact Cockroach Labs support for assistance.
Note:

When using Kubernetes, recovery actions happen automatically in many cases and no action needs to be taken.

Data failure

When dealing with data failure due to bad actors, rogue applications, or data corruption, domain expertise is required to identify the affected rows and determine how to remedy the situation (e.g., remove the incorrectly inserted rows, insert deleted rows, etc.). However, there are a few actions that you can take for short-term remediation:

Tip:

To give yourself more time to recover and clean up the corrupted data, put your application in “read only” mode and only run AS OF SYSTEM TIME queries from the application.

Run differentials

If you are within the garbage collection window (default is 25 hours), run AS OF SYSTEM TIME queries and use CREATE TABLE AS … SELECT * FROM to create comparison data and run differentials to find the offending rows to fix.

If you are outside of the garbage collection window, you will need to use a backup to run comparisons.

Restore to a point in time

Create a new backup

If your cluster is running, you do not have a backup that encapsulates the time you want to restore to, and the data you want to recover is still in the garbage collection window, there are two actions you can take:

Recover from corrupted data in a database or table

If you have corrupted data in a database or table, restore the object from a prior backup. If revision history is in the backup, you can restore from a point in time.

Instead of dropping the corrupted table or database, we recommend renaming the table or renaming the database so you have historical data to compare to later. If you drop a database, the database cannot be referenced with AS OF SYSTEM TIME queries (see #51380 for more information), and you will need to take a backup that is backdated to the system time when the database still existed.

Note:

If the table you are restoring has foreign keys, careful consideration should be applied to make sure data integrity is maintained during the restore process.

Compromised security keys

CockroachDB maintains a secure environment for your data. However, there are bad actors who may find ways to gain access or expose important security information. In the event that this happens, there are a few things you can do to get ahead of a security issue:

Changefeeds to cloud storage

  1. Cancel the changefeed job immediately and record the high water timestamp for where the changefeed was stopped.
  2. Remove the access keys from the identity management system of your cloud provider and replace with a new set of access keys.
  3. Create a new changefeed with the new access credentials using the last high water timestamp.

Encryption at rest

If you believe the user-defined store keys have been compromised, quickly attempt to rotate your store keys that are being used for your encryption at rest setup. If this key has already been compromised and the store keys were rotated by a bad actor, the cluster should be wiped if possible and restored from a prior backup.

If the compromised keys were not rotated by a bad actor, quickly attempt to rotate the store key by restarting each of the nodes with the old key and the new key. For an example on how to do this, see Encryption.

Once all of the nodes are restarted with the new key, put in a request to revoke the old key from the Certificate Authority.

Note:

CockroachDB does not allow prior store keys to be used again.

Wire Encryption / TLS

As a best practice, keys should be rotated. In the event that keys have been compromised, quickly attempt to rotate your keys. This can include rotating node certificates, client certificates, and the CA certificate.

See also

YesYes NoNo