Trust, but verify: How CockroachDB checks replication

We built survivability into the DNA of CockroachDB. And while we had a lot of fun doing so, and are confident that we have built a solution on a firm foundation, we felt a nagging concern: Does CockroachDB really survive? When data is written to the database, will a failure really not end up in data loss? So to assuage those concerns, we adopted a Russian maxim: “Dovorey, no provorey – Trust, but Verify.”

To understand CockroachDB’s survivability promise, you must first understand our key-value store and replication model, as they form the foundation for survivability.

CockroachDB is a SQL database built on top of a distributed consistent key-value store. The entire key-value space is split into multiple contiguous key-value ranges spread across many nodes. CockroachDB uses Raft for consensus-based replicated writes, guaranteeing that a write to any key is durable, or in other words, that it survives. On each write to a key-value range, the range is synchronously replicated from a Raft leader to multiple replicas residing on different nodes. Database writes are replayed on all replicas in the same order as on the leader, guaranteeing consistent replication. This model forms the basis of CockroachDB survivability: if a node dies, a few others have an exact copy of the data that is lost.

While we’d like to trust our survivability model, we went one step further: we added verification. We built a subsystem that periodically verifies that all replicas of a range have identical data, so that when one of them has to step up to replace a lost node, the data served is what is expected.

How CockroachDB Checks Replication

Every node in the cluster runs an independent consistency checker. The checker scans through all the local replicas in a continuous loop over a 24 hour cycle, and runs a consistency check on each range for which it is the Raft leader:

The leader and replicas agree on the snapshot of the data to verify and assign a unique ID to it.
A SHA-512 based checksum on the selected snapshot is computed in parallel on the leader and all replicas.
The leader shares its checksum and the unique snapshot ID with all replicas. A replica compares the supplied checksum with the one it has computed on its own. On seeing a different checksum, it declares a failure in replication.

In the rare circumstance that it finds a replication problem, the checker logs an error or panics. Unfortunately, it is not possible for the system to identify whether the Raft leader or its replica is responsible for a faulty replication, and a replication problem cannot be repaired.

Logging or panicking on detecting an inconsistency problem is great, but how does one debug such problems? To aid the user in debugging, on seeing a checksum mismatch, a second consistency check with an advanced diff setting is triggered on the range. The second consistency check publishes the entire range snapshot from the leader to all replicas, so that they can diff the entire snapshot against their own version and log a diff of the keys that are inconsistent. We were able to use this second consistency check mechanism to debug and fix many problems in replication.

While we had hoped we had built the perfect system, the verifier uncovered a few bugs!

Bugs and Fixes

We fixed a number of hairy bugs:

The same data can look different:
The order of entries in protocol buffer maps is undefined in the standard. For some cockroach internal data, we were using protocol buffer maps and replicating it via Raft. All the replicas would separately convert the protocol buffers into a wire format while writing them to disk. Since the wire format is non-deterministic in the standard and implementation, we saw the same data with a different wire encoding on different replicas (fixed via gogo/protobuf#156).
Backdoor writes on the replicas (#5090):
We collect statistics on every range in the system and replicate them. Writes to the statistics are allowed only on the leader, with replicas simply replaying the operations. We were occasionally updating the statistics on the replicas outside of Raft consensus (fixed via#5133).
Internal time series data was being merged incorrectly:
The CockroachDB UI keeps track of monitoring time series data which is replicated. The replicas merge the time series data when read, but an occasional legitimate read would creep in on the leader, causing a bug in the merge functionality (fixed via#5515).
Floating point addition might be non-deterministic:
While fixing the above time series issue, we got super paranoid and as a defensive measure decided to not depend on the replica replay of floating point addition. We were aware of floating point addition being non-associative, and although we knew our floating point additions were being replayed in a definite order and didn’t depend on the associativity property, we adhered to the mantra, “only the paranoid survive,” and got rid of them (fixed via#5905).

Conclusion

We’ve built a new database that you can trust to survive. With trust comes verification, and we built that into CockroachDB. CockroachDB offers other features like indexes, unique columns, and foreign keys, that you can trust to work properly. We plan on building automatic online verification mechanisms for them, too. We look forward to discussing them in the future.