We built survivability into the DNA of CockroachDB. And while we had a lot of fun doing so, and are confident that we have built a solution on a firm foundation, we felt a nagging concern: Does CockroachDB really survive? When data is written to the database, will a failure really not end up in data loss? So to assuage those concerns, we adopted a Russian maxim: “Dovorey, no provorey – Trust, but Verify.”
To understand CockroachDB’s survivability promise, you must first understand our key-value store and replication model, as they form the foundation for survivability.
CockroachDB is a SQL database built on top of a distributed consistent key-value store. The entire key-value space is split into multiple contiguous key-value ranges spread across many nodes. CockroachDB uses Raft for consensus-based replicated writes, guaranteeing that a write to any key is durable, or in other words, that it survives. On each write to a key-value range, the range is synchronously replicated from a Raft leader to multiple replicas residing on different nodes. Database writes are replayed on all replicas in the same order as on the leader, guaranteeing consistent replication. This model forms the basis of CockroachDB survivability: if a node dies, a few others have an exact copy of the data that is lost.
While we’d like to trust our survivability model, we went one step further: we added verification. We built a subsystem that periodically verifies that all replicas of a range have identical data, so that when one of them has to step up to replace a lost node, the data served is what is expected.
Every node in the cluster runs an independent consistency checker. The checker scans through all the local replicas in a continuous loop over a 24 hour cycle, and runs a consistency check on each range for which it is the Raft leader:
In the rare circumstance that it finds a replication problem, the checker logs an error or panics. Unfortunately, it is not possible for the system to identify whether the Raft leader or its replica is responsible for a faulty replication, and a replication problem cannot be repaired.
Logging or panicking on detecting an inconsistency problem is great, but how does one debug such problems? To aid the user in debugging, on seeing a checksum mismatch, a second consistency check with an advanced diff setting is triggered on the range. The second consistency check publishes the entire range snapshot from the leader to all replicas, so that they can diff the entire snapshot against their own version and log a diff of the keys that are inconsistent. We were able to use this second consistency check mechanism to debug and fix many problems in replication.
While we had hoped we had built the perfect system, the verifier uncovered a few bugs!
We fixed a number of hairy bugs:
We’ve built a new database that you can trust to survive. With trust comes verification, and we built that into CockroachDB. CockroachDB offers other features like indexes, unique columns, and foreign keys, that you can trust to work properly. We plan on building automatic online verification mechanisms for them, too. We look forward to discussing them in the future.