Prevention > Recovery: Physical cluster replication

When you are an enterprise application owner, whether architect or operator, regional survivability is at the top of your list of must-have database features.

Meanwhile, CockroachDB is purpose-built to handle regional failures with zero RPO and near-zero RTO by replicating data across nodes, availability zones, regions, or clouds while consistently serving traffic from each location. Sounds like the perfect match, right?

Maybe not quite perfect. If, like many large orgs, your application was built around a legacy failover strategy, your environment may only have two regions instead of three. You could still move to CockroachDB, but this limited regional distribution meant your application couldn’t benefit from the full resilience features of our database. Until now, that is, with the arrival of CockroachDB v23.2 and physical cluster replication.

What is physical cluster replication and how does it work?

So what is physical cluster replication? Also known as PCR, it’s an asynchronous, byte-level, and consistent way to keep two clusters in CockroachDB up to date.

Physical cluster replication works by creating an exact, constantly updating replica of a primary cluster, while providing granular observability and control over data replication and cutover times. Data integrity checks ensure your recovery will support your business continuity goals. Because this replication is asynchronous, you can significantly improve your resilience without incurring a cross-region write latency hit (which can push certain applications beyond their latency budgets).

Physical cluster replication is extremely flexible, and works across different hardware configurations; it can extend to multiple standby clusters, and even multiple clouds.

When would you use physical cluster replication?

There are several scenarios that benefit from physical cluster replication.

First, let’s say your application is built around having two data centers, with no access to the cloud or to a third data center. Previously, your application could certainly run on CockroachDB and still benefit from built-in horizontal scalability, ACID transactions, and zero downtime schema changes and online database updates. Not to mention the simplification that comes with having a single logical database across your two DCs. Now, in version 23.2, physical cluster replication unlocks CockroachDB’s full multi-region functionality for you. (In an ideal world, however, you would have access to a third data center because then you can survive different failure models without having to do any kind of disaster recovery).

Even customers already running CockroachDB in three or more multi-region environments, though, may still want to use physical cluster replication for a defense-in-depth approach — avoiding the potential for human errors that can accidentally take down an entire cluster (believe me, it happens).

Physical cluster replication vs traditional backup and restore

In a traditional backup and restore model, you are backing up everything and then restore into a clean environment. This takes awhile — hours, or even days, depending on how large your dataset is.

With physical cluster replication, on the other hand, all that data has already been steadily replicating to the standby cluster. At the highest level, the physical cluster replication process involves creating two clusters, starting replication, handling failover and cutover, and potentially backfilling missing data. (Have we mentioned that this all happens in a highly automated fashion?)

As a result, physical cluster replication offers significantly lower RPO and RTO compared to traditional backup and restore methods.

Why is physical cluster replication important?

It comes down to taking a disaster recovery approach vs. a .

Physical cluster replication is important because, until the advent of distributed SQL for cloud native applications, the common architecture pattern for resilience was to run on two cloud regions (or two physical data centers, or one cloud region and one physical data center). If one of those regions becomes unavailable, parts of your application, or even the entire thing, will be unavailable to users until backup and restore is complete. This traditional way to architect for resilience is a disaster recovery approach. Now that physical cluster replication with fast cutback is available in CockroachDB 24.1, architects and operators who design and manage applications in enterprise environments with limited regional distribution can still ensure high resilience with CockroachDB. After all, CRDB is built to survive regional outages in an automated and self-healing manner: this is the very essence of disaster prevention.

When your architecture is designed around a disaster prevention approach, disaster recovery becomes kind of beside the point.

Watch physical cluster replication in action

You can witness the power of physical cluster replication live and in action for yourself in this video. Principal Engineer and CockroachDB Technical Evangelist Rob Reid puts PCR through its paces in this demo of CockroachDB clusters self-healing through the process of creating two clusters, starting replication, handling failover and cutover, and potentially backfilling data if any has gone missing.