It’s a fact that the design of CockroachDB is based on Google’s Spanner data storage system. One of the most surprising and inspired facets of Spanner is its use of atomic clocks and GPS clocks to give participating nodes really accurate wall time synchronization. The designers of Spanner call this ‘TrueTime’, and it provides a tight bound on clock offset between any two nodes in the system. TrueTime enables high levels of external consistency. As an open source database based on Spanner, our challenge was in providing similar guarantees of external consistency without atomic clocks.
If someone knows even a little about Spanner, one of the first questions they have is: “You can’t be using atomic clocks if you’re building an open source database; so how the heck does CockroachDB work?”
It’s a very good question.
CockroachDB was designed to work without atomic clocks or GPS clocks. It’s an open source database intended to be run on arbitrary collections of nodes: from physical servers in a corp development cluster to public cloud infrastructure using the flavor-of-the-month virtualization layer. It’d be a showstopper to require an external dependency on specialized hardware for clock synchronization.
So what does CockroachDB do instead? Well, before answering that question, it’ll be helpful to dig a little deeper into why TrueTime was conceived for Spanner.
The importance of time for distributed systems
Synchronized time is a holy grail of sorts for distributed systems research. In essence, it provides a means to absolutely order events, regardless of which distributed node an event originated at. This can be especially useful when performance is at stake, allowing subsets of nodes to make forward progress without regard to the rest of the cluster, while still maintaining global ordering guarantees.
By contrast, systems without tightly synchronized clocks (e.g. atomic clocks, GPS clocks, etc.) that wish to establish a global ordering must communicate with a single source of time on every operation. This is the purpose of the “timestamp oracle” employed by Percolator. A system which provides an absolute ordering of causally related events, regardless of observer, provides for a particularly strong guarantee of external consistency known as “linearizability.”
Let’s follow one more tangent and introduce the concept of “serializability.” Most database developers are familiar with serializability as the highest isolation level provided by the ANSI SQL standard. It guarantees that the constituent reads and writes within a transaction occur as though that transaction were given exclusive access to the database for the length of its execution, guaranteeing that no transactions interfere with each other. In other words, no concurrent transaction B is able to read any partially-written state of transaction A or perform writes causing transaction A to read different values for the same key over the course of its execution.
In a non-distributed database, serializability implies linearizability for causally-related transactions because a single node has a monotonically increasing clock (or should, anyway!). If transaction A is committed before starting transaction B, then transaction B can commit only at a later time.
In a distributed database, things can get dicey. It’s easy to see how the ordering of causally-related transactions can be violated if nodes in the system have unsynchronized clocks. Assume there are two nodes, node 1 and node 2, and two transactions, A and B. Node 1 has a clock which is fast by 100ms; node 2 has an accurate clock. Transaction A writes only to node 1 and commits there with t=100ms. An observer sees A commit and starts B 50ms later. Transaction B writes only to node 2 and commits there with t=50ms (remember node 2 has an accurate clock). Now, any observer reading from the database will see the opposite ordering. B’s writes occur before A’s. ¡No bueno! Note that this can only happen when the two transactions access completely disjoint sets of keys (implies disjoint sets of nodes).
While Spanner provides linearizability, CockroachDB’s external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice.
How does TrueTime provide linearizability?
OK, back to Spanner and TrueTime. It’s important to keep in mind that TrueTime does not guarantee perfectly synchronized clocks. Rather, TrueTime gives an upper bound for clock offsets between nodes in a cluster. Synchronization hardware helps minimize the upper bound. In Spanner’s case, Google mentions an upper bound of 7ms. That’s pretty tight; by contrast, using NTP for clock synchronization is likely to give somewhere between 100ms and 250ms.
So how does Spanner use TrueTime to provide linearizability given that there are still inaccuracies between clocks? It’s actually surprisingly simple. It waits. Before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms. Pretty clever.
How important is linearizability in maintaining external consistency?
Stronger guarantees are a good thing, but some are more useful than others. The possibility of reordering commit timestamps for causally related transactions is likely a marginal problem in practice. What could happen is that examining the database at a historical timestamp might yield paradoxical situations where transaction A is not yet visible while transaction B is, even though transaction A is known to have preceded B, as they’re causally related. However, this can only happen if there’s no overlap between the keys read or written during the transactions.
For situations where reordering could be problematic, CockroachDB returns a “causality token,” which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp. Of course, this mechanism doesn’t properly order independent causal chains, though imagining a use case where that’s a problem requires creativity.
But there’s a more critical use for TrueTime than ordering transactions. When starting a transaction reading data from multiple nodes, a timestamp must be chosen which is guaranteed to be at least as large as the highest commit time across all nodes. If that’s not true, then the new transaction might fail to read already-committed data – an unacceptable breach of consistency. With TrueTime at your disposal, the solution is easy; simply choose the current TrueTime. Since every already-committed transaction must have committed more than 7ms earlier, the current node’s wall clock must have a time greater than or equal to the most recently committed transaction. Wow, that’s easy and efficient. So what does CockroachDB do?
How CockroachDB chooses transaction timestamps
The short answer is: something harder and less efficient. The longer answer is that CockroachDB must discover an appropriate timestamp for the transaction as it proceeds, in some cases restarting the transaction at a later timestamp.
As mentioned in the prior paragraph, the timestamp we choose for the transaction must be greater or equal to the max commit timestamp across all nodes we intend to read from. If we knew the nodes which would be read from in advance, we could send a parallel request for the max timestamp from each and use the latest. However, this is clumsy because we usually don’t know the nodes in advance. It’s also inefficient because we would have to wait for the slowest node to respond before starting.
What CockroachDB does instead is actually surprisingly similar to what Spanner does, though with much looser clock synchronization requirements.
While Spanner always waits after writes, CockroachDB sometimes waits before reads.
When CockroachDB starts a transaction, it chooses a timestamp based on the current node’s wall time. It also establishes an upper bound on the current wall time by adding the maximum clock offset for the cluster. As the transaction reads data from various nodes, it proceeds without difficulty so long as it doesn’t encounter a key written after the transaction timestamp but before the upper bound. This interval of timestamps represents a window of uncertainty. Given clock offsets, we can’t say for certain whether a newer value for the key was committed before our transaction started. In such cases, the transaction must be restarted, but this time with the transaction timestamp set to the encountered value’s newer timestamp. Crucially, the upper bound doesn’t change on restart, so the window of uncertainty shrinks. Transactions reading constantly updated data from many nodes might be forced to restart multiple times, though never for longer than the upper bound, nor more than one time per node.
A simple statement of the contrast between Spanner and CockroachDB would be: Spanner always waits on writes for a short interval, whereas CockroachDB sometimes waits on reads for a longer interval. How long is that interval? Well it depends on how clocks on CockroachDB nodes are being synchronized. Using NTP, it’s likely to be up to 250ms. Not great, but the kind of transaction that would restart for the full interval would have to read constantly updated values across many nodes. In practice, these kinds of use cases exist but are the exception.
Because CockroachDB relies on clock synchronization, nodes run a version of Marzullo’s algorithm amongst themselves to measure maximum clock offset within the cluster. If the configured maximum offset is exceeded by any node, it will commit suicide.
If you’ve made it this far, thanks for hanging in there. If you’re new to it, this is tricky stuff to grok.
CockroachDB actually has an experimental mode which makes it behave like Spanner: wait for the max clock offset before returning a successful commit. This isn’t practical if you’re using NTP synchronization, but we believe there’s a trend towards more accurate time signals underway.
Chip-scale atomic clocks are a reality; putting one on server motherboards would beat the pants off a quartz crystal oscillator.