Living Without Atomic Clocks

Written by Spencer Kimball

It’s a fact that the design of CockroachDB is based on Google’s Spanner data storage system. One of the most surprising and inspired facets of Spanner is its use of atomic clocks and GPS clocks to give participating nodes really accurate wall time synchronization. The designers of Spanner call this ‘TrueTime’, and it provides a tight bound on clock offset between any two nodes in the system. TrueTime enables high levels of external consistency. As an open source database based on Spanner, our challenge was in providing similar guarantees of external consistency without atomic clocks.

1955: The first atomic clock is developed at NPL.

If someone knows even a little about Spanner, one of the first questions they have is: “You can’t be using atomic clocks if you’re building an open source database; so how the heck does CockroachDB work?”

It’s a very good question.

CockroachDB was designed to work without atomic clocks or GPS clocks. It’s an open source database intended to be run on arbitrary collections of nodes: from physical servers in a corp development cluster to public cloud infrastructure using the flavor-of-the-month virtualization layer. It’d be a showstopper to require an external dependency on specialized hardware for clock synchronization.

So what does CockroachDB do instead? Well, before answering that question, it’ll be helpful to dig a little deeper into why TrueTime was conceived for Spanner.

The importance of time for distributed systems


Synchronized time is a holy grail of sorts for distributed systems research. In essence, it provides a means to absolutely order events, regardless of which distributed node an event originated at. This can be especially useful when performance is at stake, allowing subsets of nodes to make forward progress without regard to the rest of the cluster, while still maintaining global ordering guarantees.

By contrast, systems without tightly synchronized clocks (e.g. atomic clocks, GPS clocks, etc.) that wish to establish a global ordering must communicate with a single source of time on every operation. This is the purpose of the “timestamp oracle” employed by Percolator. A system which provides an absolute ordering of causally related events, regardless of observer, provides for a particularly strong guarantee of external consistency known as “linearizability.”


Let’s follow one more tangent and introduce the concept of “serializability.” Most database developers are familiar with serializability as the highest isolation level provided by the ANSI SQL standard. It guarantees that the constituent reads and writes within a transaction occur as though that transaction were given exclusive access to the database for the length of its execution, guaranteeing that no transactions interfere with each other. In other words, no concurrent transaction B is able to read any partially-written state of transaction A or perform writes causing transaction A to read different values for the same key over the course of its execution.

In a non-distributed database, serializability implies linearizability for causally-related transactions because a single node has a monotonically increasing clock (or should, anyway!). If transaction A is committed before starting transaction B, then transaction B can commit only at a later time.

In a distributed database, things can get dicey. It’s easy to see how the ordering of causally-related transactions can be violated if nodes in the system have unsynchronized clocks. Assume there are two nodes, node 1 and node 2, and two transactions, A and B. Node 1 has a clock which is fast by 100ms; node 2 has an accurate clock. Transaction A writes only to node 1 and commits there with t=100ms. An observer sees A commit and starts B 50ms later. Transaction B writes only to node 2 and commits there with t=50ms (remember node 2 has an accurate clock). Now, any observer reading from the database will see the opposite ordering. B’s writes occur before A’s. ¡No bueno! Note that this can only happen when the two transactions access completely disjoint sets of keys (implies disjoint sets of nodes).

While Spanner provides linearizability, CockroachDB’s external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice.

How does TrueTime provide linearizability?

OK, back to Spanner and TrueTime. It’s important to keep in mind that TrueTime does not guarantee perfectly synchronized clocks. Rather, TrueTime gives an upper bound for clock offsets between nodes in a cluster. Synchronization hardware helps minimize the upper bound. In Spanner’s case, Google mentions an upper bound of 7ms. That’s pretty tight; by contrast, using NTP for clock synchronization is likely to give somewhere between 100ms and 250ms.

So how does Spanner use TrueTime to provide linearizability given that there are still inaccuracies between clocks? It’s actually surprisingly simple. It waits. Before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms. Pretty clever.

How important is linearizability in maintaining external consistency?

Stronger guarantees are a good thing, but some are more useful than others. The possibility of reordering commit timestamps for causally related transactions is likely a marginal problem in practice. What could happen is that examining the database at a historical timestamp might yield paradoxical situations where transaction A is not yet visible while transaction B is, even though transaction A is known to have preceded B, as they’re causally related. However, this can only happen if there’s no overlap between the keys read or written during the transactions.

For situations where reordering could be problematic, CockroachDB returns a “causality token,” which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp. Of course, this mechanism doesn’t properly order independent causal chains, though imagining a use case where that’s a problem requires creativity.

But there’s a more critical use for TrueTime than ordering transactions. When starting a transaction reading data from multiple nodes, a timestamp must be chosen which is guaranteed to be at least as large as the highest commit time across all nodes. If that’s not true, then the new transaction might fail to read already-committed data – an unacceptable breach of consistency. With TrueTime at your disposal, the solution is easy; simply choose the current TrueTime. Since every already-committed transaction must have committed more than 7ms earlier, the current node’s wall clock must have a time greater than or equal to the most recently committed transaction. Wow, that’s easy and efficient. So what does CockroachDB do?

How CockroachDB chooses transaction timestamps

The short answer is: something harder and less efficient. The longer answer is that CockroachDB must discover an appropriate timestamp for the transaction as it proceeds, in some cases restarting the transaction at a later timestamp.

As mentioned in the prior paragraph, the timestamp we choose for the transaction must be greater or equal to the max commit timestamp across all nodes we intend to read from. If we knew the nodes which would be read from in advance, we could send a parallel request for the max timestamp from each and use the latest. However, this is clumsy because we usually don’t know the nodes in advance. It’s also inefficient because we would have to wait for the slowest node to respond before starting.

What CockroachDB does instead is actually surprisingly similar to what Spanner does, though with much looser clock synchronization requirements.

While Spanner always waits after writes, CockroachDB sometimes waits before reads.

When CockroachDB starts a transaction, it chooses a timestamp based on the current node’s wall time. It also establishes an upper bound on the current wall time by adding the maximum clock offset for the cluster. As the transaction reads data from various nodes, it proceeds without difficulty so long as it doesn’t encounter a key written after the transaction timestamp but before the upper bound. This interval of timestamps represents a window of uncertainty. Given clock offsets, we can’t say for certain whether a newer value for the key was committed before our transaction started. In such cases, the transaction must be restarted, but this time with the transaction timestamp set to the encountered value’s newer timestamp. Crucially, the upper bound doesn’t change on restart, so the window of uncertainty shrinks. Transactions reading constantly updated data from many nodes might be forced to restart multiple times, though never for longer than the upper bound, nor more than one time per node.

A simple statement of the contrast between Spanner and CockroachDB would be: Spanner always waits on writes for a short interval, whereas CockroachDB sometimes waits on reads for a longer interval. How long is that interval? Well it depends on how clocks on CockroachDB nodes are being synchronized. Using NTP, it’s likely to be up to 250ms. Not great, but the kind of transaction that would restart for the full interval would have to read constantly updated values across many nodes. In practice, these kinds of use cases exist but are the exception.

Because CockroachDB relies on clock synchronization, nodes run a version of Marzullo’s algorithm amongst themselves to measure maximum clock offset within the cluster. If the configured maximum offset is exceeded by any node, it will commit suicide.

Concluding thoughts

If you’ve made it this far, thanks for hanging in there. If you’re new to it, this is tricky stuff to grok.

CockroachDB actually has a command line flag (--linearizable) which makes it behave like Spanner: wait for the max clock offset before returning a successful commit. This isn’t practical if you’re using NTP synchronization, but we believe there’s a trend towards more accurate time signals underway. 

A chip-scale atomic clock developed at the NIST.Already, Google could provide TrueTime to its virtual Google Compute Engine (GCE) instances if it chose to. Accurate time signals are currently being distributed for the benefit of deployed Spanner nodes. If Google did do that, it would certainly put Amazon at a disadvantage as more systems like CockroachDB come online. Further, atomic clocks are miniaturizing (refer to the photo of the first-ever atomic clock at the start of this blog post). These days, chip-scale atomic clocks are a reality; putting one on server motherboards would beat the pants off a quartz crystal oscillator.

  • This is a great description. I was awed by the time sync in the Spanner paper and I remember thinking, “huh, well *that’s* one that open-source can’t replicate.” This is a great 80% solution.

  • tobe433

    Great job. I was surprised by Spanner’s idea to use atomic clocks and surprised by Cockroach now.

  • Colin Fraizer

    Good description, but I don’t understand something: why use actual time for this instead of something like Lamport’s logical clocks? [see:

    • Spencer Kimball

      There’s a lot of benefit to tying your database timestamps to something which actually does approximate “real world” time. With CockroachDB, you can repeatably do queries at historical timestamps (think for example of running some analytics reports on the state of your customer/order database at the close of the fiscal year), so having real timestamps attached to MVCC values is crucial.

      This is where hybrid logical time is used. Take a look at these two papers for reference:

  • ImJasonH

    GCE is Google Compute Engine, not Google Cloud Engerprise.

    • Jessica Edwards

      Fixed! Thanks for the catch.

  • amitlan

    In “no subsequent transaction may commit at an earlier timestamp”, the “no subsequent transactions” mean those originating from anyone who learns of the committed transaction, right?

  • Philip Haynes

    I like the idea of atomic clocks in the datacenter. Any suggestions on ones you would actually buy?

    • ntpguy

      these are two decent starting points that i also used to get set up with a stratum-1 ntp server on a raspi zero and an adafruit gps breakout board using the PPS for sync. total cost <100usd

      This is what an average day looks like when the gps has a lock: so like 90%+ of the time offset is below 10µs. Hope this helps. My laptop synced over wifi reports between 0.5~5ms offset when synced to the rpi. hope this helps; it sounds like you're looking for more of a plug and play solution but I've been quite happy with the performance of this cheap DIY way.

      • Philip Haynes

        This looks awesome. Our RAFT append entries are about 30us with ~300us round trip times on a LAN. With low cost GPS NTP options.

Stay up-to-date with Cockroach News