We wrote the book on distributed scale. Literally.Free O'Reilly Book
A few days ago, prompted by a Hacker News post, my friend Ivo texted me saying “Does your head ever explode when you’re thinking about databases and consistency semantics and whatever models? It just sounds like pointless taxonomy stuff. We are <N, K>-serializable whereas QuinoaDB is only ü-serializable”. The answer is yes — my head does explode. I don’t think it’s pointless, though, although I agree that the discussions are generally unproductive.
Separately, the other day a colleague told a user that “CockroachDB implements serializability, not linearizability”. While we say this statement often, and it is the best kind of correct, I don’t like it much because I think it doesn’t do us justice and it’s also not particularly helpful for the users — it doesn’t teach them very much about CockroachDB.
In this post, I’m attempting to present the guarantees that CockroachDB gives and the ones it doesn’t, and offer my preferred marketing slogan summarizing it all.
The first section provides background and some terminology for consistency models to support the following, CockroachDB-specific section. It’s not formal, rigorous or exhaustive (I link to better sources, though) so readers who are familiar with these things might want to skip it and head straight to the section on CockroachDB’s consistency model.
First of all, a brief introduction to what we’re talking about. Databases let many “clients” access data concurrently, and so they need to define the semantics of these concurrent accesses: for example, what happens if two clients read and write the same data “at the same time”. Moreover, distributed and replicated databases generally store multiple copies of the data, usually over a network of machines, and so they need to define what complications can arise from the fact that different machines are involved in serving reads and writes to the same data: e.g. if I tell machine A to write a key, and then immediately after I ask machine B to read it, will machine B return the data that had been just written? Informally speaking, what we’d ideally want from our database is to hide the data distribution and replication from us and to behave as if all transactions were being run one at a time by a single machine. A database that provides this kind of execution is said to implement the “strict serializability” consistency model - that’s the holy grail.
But, of course, we also want our database to be resilient to machine failure, and we want the transactions to execute fast, and we want many transactions to execute at the same time, and we want data for European customers to be served from European servers and not cross an ocean network link. All these requirements generally come in conflict with strict serializability. So then databases start relaxing the strict serializability guarantees, basically compromising on that front to get execution speed and other benefits. These compromises need precise language for explaining them. For example, consider a replicated database and a write operation executed by one of the replicas followed quickly by a read operation served by another one. What are admissible results for this read? Under strict serializability, the answer is clear — only the value of the preceding write is acceptable. Under more relaxed models, more values are allowed in addition to this one. But which values exactly? Is a random value coming out of thin air acceptable? Generally, no. Is the value of some other relatively recent write acceptable? Perhaps. To define things precisely, we need specialized vocabulary that’s used by well studied sets of rules (called “consistency models”).
Historically, both the distributed systems community and the databases community have evolved their own terminology and models for consistency. In more recent years, the communities have joined, driven by the advent of “distributed databases”, and the vocabularies have combined. Things are tricky though, plus different databases try to market themselves the best way they can, and so I think it’s fair to say that there’s a lot of confusion on the topic. I’ve been thinking about these things for a couple of years now in the context of CockroachDB, and I still always struggle to make unequivocal and clear statements on the subject. Additionally, I’ll argue that none of the standard lexicon describes CockroachDB very well. For a more systematic treaty on the different meanings of consistency, see The many faces of consistency and Jepsen’s treatment of the topic.
The databases community has been describing behavior in terms of _transactions_, which are composite operations (made up of SQL queries). Transactions are subject to the ACID properties (Atomicity, Consistency, Isolation, Durability). This community was primarily interested in the behavior of concurrent transactions on a single server, not so much in the interactions with data replication — it was thus initially not concerned by the historical issues around distributed consistency. For our discussion, the Isolation property is the relevant one: we have multiple transactions accessing the same data concurrently and we need them to be isolated from each other. Each one needs to behave, to the greatest extent possible, as if no other transaction was interfering with it. Ironically, the Consistency in ACID refers to a concept that’s only tangentially related to what we’re talking about here — the fact that the database will keep indexes up to date automatically and will enforce foreign key constraints and such.
To describe the possible degrees of transaction isolation, the literature and the ANSI standard enumerates a list of possible “anomalies” (examples of imperfect isolation), and, based on those, defines a couple of standard “isolation levels”: Read Uncommitted, Read Committed, Repeatable Read, Serializable. To give a flavor of what these are about, for example the Repeatable Read isolation level says that once a transaction has read some data, reading it again within the same transaction yields the same results. So, concurrent transactions modifying that data have to somehow not affect the execution of our reading transaction. However, this isolation level allows the Phantom Read anomaly. Basically, if a transaction performs a query asking for rows matching a condition twice, the second execution might return more rows than the first. For example, something like
select * from orders where value > 1000 might return orders (a, b, c) the first time and (a, b, c, d) the second time (which is ironic given Repeatable Read’s name since one might call what just happened a non-repeatable read).
Frankly, the definitions of the ANSI isolation levels are terrible (also see A Critique of ANSI SQL Isolation Levels), arguably with the exception of the Serializable one. They have been defined narrow-mindedly with a couple of database implementations in mind and have not stood the test of time.
The Serializable isolation level, which, as far as the SQL standard is concerned, is the gold standard, doesn’t allow any of the defined anomalies. In plain terms, it states that the database needs to ensure that transactions need to behave as if the transactions executed sequentially, one by one. The definition allows that database to choose the order of transactions in an equivalent sequential execution. This is less than ideal because it allows for the following scenario:
We consider three transactions. The first one is
insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment'). The second one is
insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong'). The third one is
select id, text from comments.
(2, 'OP is wrong').
So, Nathan is seeing the response, but not the original post. That’s not good. And yet, it is allowed by the Serializable isolation level and, in fact, likely to occur in many distributed databases (spoiler alert: not in CRDB), assuming the actors were quick to yell at each other and run their transactions. The serial order in which the transactions appear to have executed is 2, 3, 1.
What has happened here is that the actors synchronized with each other outside of the database and expected the database’s ordering of transactions to respect “real time”, but the isolation levels don’t talk about “real time” at all. This seems to not have been a concern for the SQL standardization committee at the time, probably since this kind of thing simply wouldn’t happen if the database software runs entirely on one machine (however many database researchers were thinking about the issues of distributed databases as early as the 70s–for example, see Papadimitriou paper on serializability).
While database people were concerned with transaction isolation, researchers in distributed and parallel systems were concerned with the effects of having multiple copies of data on the system’s operations. In particular, they were concerned with the semantics of “read” and “write” operations on this replicated data. So, the literature evolved a set of operation “consistency levels”, with names like “read your own writes”, “monotonic reads”, “bounded staleness”, “causal consistency”, and “linearizable” which all give guidance about what values a read operation can return under different circumstances. The original two problems in need of solutions were how to resolve concurrent writes to the same logical address from two writers at separate physical locations using local replicas (CPUs on their local cache, NFS clients on their local copy), and when/how a stale copy should be updated (cache invalidation). The spectrum of possible solutions has been explored in different ways by the original communities: designers of memory caches were constrained by much tighter demands of programmers on consistency, whereas networked filesystems were constrained by unreliable networks to err on the side of more availability.
Generally speaking, this evolutionary branch of consistency models doesn’t talk about transactions. Instead, systems are modeled as collections of objects, with each object defining a set of operations it supports. For example, assuming we have a key-value store that provides the operations read(k) and write(k,v), the system obeys the “monotonic reads” model if, once a process reads the value of a key k, any successive read operation on k by that process will always return that same value or a more recent value. In other words, reads by any one process don’t “go backwards”.
There’s two things to note about this model’s definition: first of all, it talks about a “process”, so the system has a notion of different threads of control. Understanding this is a burden; the serializable isolation level we discussed in the databases context did not need such a concept1 — the user of a system did not need to think about what process is performing what operations. Second, this model is quite relaxed in comparison to others. If one process performs a write(“a”, 1) and later another process performs read(“a”) (and there’s no intervening writes to “a”), then the read might not return 1. The monotonic reads model describes various distributed systems where data is replicated asynchronously and multiple replicas can all serve reads.
The gold standard among these models is linearizability. It was formalized by Herlihy and Wing in a delightful paper.
This model aims to describe systems with properties pretty similar to the ones guaranteed for database transactions by the Serializable isolation level. Informally, it says that operations will behave as if they were executed one at a time, and an operation that finished before another one began (according to “real time”) has to execute before the second one. This model, assuming systems can actually implement it efficiently, sounds really good. Let’s definite it more formally.
Usually, linearizability is defined at the level of a single, relatively simple “object” and then expanded to the level of a system comprised of many such objects. So, we have an object that affords a couple of operations, and we want to devise a set of rules for how these operations behave. An operation is modeled as an “invocation” (from a client to the object) followed by a “response” (from the object to the client). We’re talking in a concurrent setting, where many clients are interacting with a single object concurrently. We define a “history” to be a set of invocations and responses.
For example, say our object is a FIFO queue (providing the enqueue/dequeue operations). Then a history might be something like:
client 1: enqueue “foo”
client 1: ok
client 1: dequeue
client 1: ok (“foo”)
client 1: enqueue “bar”
client 2: enqueue “baz”
client 1: ok
client 2: ok
client 1: dequeue
client 1: ok (“baz”)
The first event in this history is an invocation by client 1, the second one is the corresponding response from the queue object. Responses for dequeue operations are annotated with the element they return.
We say that a given history is “sequential” if every invocation is immediately followed by a response. H1 is not sequential since it contains, for example, this interleaving of operations:
client 1: enqueue “bar”
client 2: enqueue “baz”
Sequential histories are easy to reason about and check for validity (e.g. whether or not our FIFO queue is indeed FIFO). Since H1 is not sequential, it’s a bit hard to say whether the last response client 1 got is copacetic. Here’s where we use linearizability: we say that a history H is linearizable if it is equivalent to some valid sequential history H', where H' contains the same events, possibly reorderdered under the constraint that, if a response op1 appears before an invocation op2 in H, then this order is preserved in H'. In other words, a history is linearizable if all the responses are valid according to a sequential reordering that preserves the order of non-overlapping responses.
For example, H1 is in fact linearizable because it’s equivalent to the following sequential history:
client 1: enqueue “foo”
client 1: ok
client 1: dequeue
client 1: ok (“foo”)
client 2: enqueue “baz”
client 2: ok
client 1: enqueue “bar”
client 1: ok
client 1: dequeue
client 1: ok (“baz”)
Now, an object is said to be linearizable if all the histories it produces are linearizable. In other words, no matter how the clients bombard our queue with requests concurrently, the results need to look as if the requests came one by one. If the queue is to claim linearizability, the implementation should use internal locking, or whatever it needs to do, to make this guarantee. Note that this model does not explicitly talk about replication, but the cases where it is of value are primarily systems with replicated state. If our queue is replicated across many machines, and clients talk to all of them for performing operations, “using internal locking” is not trivial but has to somehow be done if we want linearizability.
To raise the level of abstraction, a whole system is said to be linearizable if it can be modeled as a set of linearizable objects. Linearizability has this nice “local” property: it can be composed like that. So, for example, a key-value store that offers point reads and point writes can be modeled as a collection of registers, with each register offering a read and write operation. If the registers individually provide linearizability, then the store as a whole also does.
Two things are of note about the linearizable consistency model:
First, there is a notion of “real time” used implicitly. Everybody is able to look at one clock on the wall so that it can be judged which operation finishes before another operation begins. The order of operations in our linearizable histories has a relation with the time indicated by this mythical clock.
Second, concurrent operations are allowed to execute in any order. For example, in our history H1, the last event might have been
client 1: ok (“bar”) because a serial history where enqueuing baz finishes before enqueuing baz begins would also have been acceptable.
It’s worth reminding ourselves that linearizability does not talk about transactions, so this model by itself is not well suited to be used by SQL databases. I guess one could shoehorn it by saying that the whole database is one object which provides one transaction operation, but then a definition needs to be provided for the functional specifications of this operation. We’re getting back to the ACID properties and the transaction isolation levels, and I’m not sure how the formalism would work exactly.
What the literature does for advancing a database model to incorporate this relationship that linearizability has with time is to incorporate its ideas into the serializable transaction isolation level.
The mentioning of “real time” and the use of a global clock governing a distributed system are fighting words for some of my colleagues. It’s understandable since, on the one hand, Einstein realized that time itself is relative (different observers can perceive events to take place in different orders relative to each other) and, on the other hand, even if we are to ignore relativistic effects for practical purposes, this one true, shared clock doesn’t quite exist in the context of a distributed system. I’m not qualified to discuss relativistic effects beyond acknowledging that there is such a thing as _relativistic linearizability_. I believe the casual database user can ignore them, but I’ll start blabbering if you ask me exactly why.
The fact that there is no shared clock according to which we can decide ordering is a problem all too real for implementers of distributed systems like CockroachDB. The closest we’ve come is a system called TrueTime built by Google, which provides tightly synchronized clocks and bounded errors brought front and center.
As far as the linearizability model is concerned (which assumes that a shared clock exists), the way I think about it is that the model tells us what to expect if such a clock were to exist. Given that it doesn’t quite exist, then clients of the system can’t actually use it to record their histories perfectly: one can’t simply ask all the clients, or all the CockroachDB replicas, to log their operation invocations and responses and timestamp them using the local clocks, and then centralize all the logs and construct a history out of that. This means that verifying a system that claims to be linearizable isn’t trivial. In other words, Herlihy talks about histories but doesn’t describe how one might actually produce these histories in practice. But that doesn’t mean the model is not useful.
What a verifier can do is record certain facts like “I know that this invocation happened after this other invocation, because there was a causal relationship between them”. For certain operations for which there was not a causal relationship, the client might not have accurate enough timestamps to put in the history and so such pairs of events can’t be used to verify whether a history is linearlizable or not. Alternatively, another thing a verifier might do is relay all its operations through a singular “timestamp oracle”, whose recording would then be used to produce and validate a history. Whether such a construct is practical is debatable, though, since the mere act of sequencing all operations would probably introduce enough latency in them as to hide imperfections of the system under test.
As I was saying, the ANSI SQL standard defines the serializable transaction isolation as the highest level, but its definition doesn’t consider phenomena present in distributed databases. It admits transaction behavior that is surprising and undesirable because it doesn’t say anything about how some transactions need to be ordered with respect to the time at which the client executed them.
To cover these gaps, the term “strict serializability” has been introduced for describing (distributed) databases that don’t suffer from these undesirable behaviors.
Strict serializability says that transaction behavior is equivalent to some serial execution, and the serial order of transactions corresponds to real time (i.e. a transaction started after another one finished will be ordered after it). Note that strict serializability (like linearizability) still doesn’t say anything about the relative ordering of concurrent transactions (but, of course, those transaction still need to appear to be “isolated” from each other). We’ll come back to this point in the next sections.
Under strict serializability, the system behavior outlined in the Hacker News posts example from the Serializability section is not permitted. Databases described by the strict serializability model must ensure that the final read, Nathan’s, must return both the root comment and the response. Additionally, the system must ensure that a query like
select * from hacker_news_comments never returns the child comment without the parent, regardless of the the time when the query is executed (i.e. depending on the time when it’s executed, it can return an empty set, the root, or both the root and the child). We’ll come back to this point when discussing CRDB’s guarantees.
Google’s Spanner uses the term “external consistency” instead of “strict serializability”. I like that term because it emphasizes the difference between a system that provides “consistency” for transactions known to the database to be causally related and systems that don’t try to infer causality and offer stronger guarantees (or, at least, that’s how me and my buddies interpret the term). For example, remembering the Hacker News example, there are systems that allow Tobi to explicitly tell the database that his transaction has been “caused” by my transaction, and then the system guarantees that the ordering of the two transaction will respect this. Usually this is done through some sort of “causality tokens” that the actors pass around between them. In contrast, Spanner doesn’t require such cooperation from the client in order to prevent the bad outcome previously described: even if the clients coordinated “externally” to the database (e.g, by yelling across the room), they’ll still get the consistency level they expect.
Peter Bailis has more words on Linearizability, Serializability and Strict Serializability.
Now that we’ve discussed some general concepts, let’s talk about how they apply to CockroachDB. CockroachDB is an open-source, transactional, SQL database and it’s also a distributed system. In my opinion, it comes pretty close to being the Holy Grail of databases: it offers a high degree of “consistency”, it’s very resilient to machine and network failures, it scales well and it performs well. This combination of features already makes it unique enough; the system goes beyond that and brings new concepts that are quite game-changing — good, principled control over data placement and read and write latency versus availability tradeoffs in geographically-distributed clusters. All without ever sacrificing things we informally refer to as “consistency” and “correctness” in common parlance. Also it’s improving every day at a remarkable pace. I’m telling you — you need to try this thing!
But back to the subject at hand — the consistency story. CockroachDB is a complex piece of software; understanding how it all works in detail is not tractable for most users, and indeed it will not even be a good proposition for all the engineers working on it. We therefore need to model it and present a simplified version of reality. The model needs to be as simple as possible and as useful as possible to users, without being misleading (e.g. suggesting that outcomes that one might think are undesirable are not possible when in fact they are). Luckily, because CockroachDB was always developed under a “correctness first” mantra, coming up with such a model is not too hard, as I’ll argue.
There’s a standard disclosure that comes with our software: the system assumes that the clocks on the Cockroach nodes are somewhat synchronized with each other. The clocks are allowed to drift away from each other up to a configured “maximum clock offset” (by default 500ms). Operators need to run NTP or other clock synchronization mechanism on their machines. The system detects when the drift approaches the maximum allowed limit and shuts down some nodes, alerting an operator[^2]. Theoretically, I think more arbitrary failures modes are possible if clocks get unsynchronized quickly. More on the topic in Spencer’s post “Living Without Atomic Clocks."
Back to the consistency. For one, CockroachDB implements the serializable isolation level for transactions, as specified by the SQL standard. In contrast to most other databases which don’t offer this level of isolation as the default (or at all, for crying out loud!), this is the only isolation level we offer; users can’t opt for a lesser one. We, the CockroachDB authors, collectively think that any lower level is just asking for pain. It’s fair to say that it’s generally extremely hard to reason about the other levels and the consequences of using them in an application (see the ACIDRain paper for what can go wrong when using lower isolation levels). I’m not trying to be condescending; up until the 2.1 version we used to offer another relatively high level of isolation as an option (Snapshot Isolation), but it turned out that it (or, at least, our implementation of it) had complex, subtle consequences that even we hadn’t fully realized for the longest time. Thus, we ripped it out and instead improved the performance of the our implementation ensuring serializability as much as possible. Below serializability be dragons.
But simply saying that we’re serializable is selling our system short. We offer more than that. We do not allow the bad outcome in the Hacker News commenting scenario.
CockroachDB doesn’t quite offer strict serializability, but we’re fairly close to it. I’ll spend the rest of the section explaining how exactly we fail strict serializability, what our guarantees actually are, and some gotchas.
If there’s one canned response I wish we’d give users that pop into our chat channels asking about the consistency model, I think it should be “CockroachDB doesn’t allow stale reads”. This should be the start of all further conversations, and in fact I think it will probably preempt many conversations. Stating this addresses a large swath of anomalies that people wonder about (in relation to distributed systems). “No stale reads” means that, once a write transaction committed, every read transaction starting afterwards[^3] will see it.
Internalizing this is important and useful. It does not come by chance; the system works hard for it and so have we, the builders. In the Hacker News comments example, once I have committed my root comment, a new transaction by Nathan is guaranteed to see it. Yes, our system is distributed and data is replicated. Yes, Nathan might be talking to a different node than I was, maybe a node with a clock that’s trailing behind. In fact, the node I was talking to might have even crashed in the meantime. Doesn’t matter. If Nathan is able to read the respective table, he will be able to read my write.
Beyond serializability, saying “no stale reads” smells like linearizability (and, thus, strict serializability) since “staleness” is related to the passing of time. In fact, when people come around asking for linearizability, I conjecture that most will be satisfied by this answer. I think this is what I’d be asking for if I hadn’t educated myself specifically on the topic. Relatedly, this is also what the C(onsistency) in the famous CAP theorem is asking for. And we have it.
So why exactly don’t we claim strict serializability?
Even though CRDB guarantees (say it with me) “no stale reads”, it still can produce transaction histories that are not linearizable.
Consider the history HN2 (assume every statement is its own transaction, for simplicity):
select * from hacker_news_comments. Doesn’t get a response yet.
insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment')and commit.
insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong')and commits.
This is the “anomaly” described in Section 2.5 of Jepsen’s analysis of CRDB from back in the day.
So what happened? From Nathan’s perspective, Tobi’s transaction appears to have executed before mine. That contradicts strict serializability since, according to “real time”, Tobi ran his transaction after me. This is how CRDB fails strict serializability; we call this anomaly “causal reverse”.
Before freaking out, let’s analyze the circumstances of the anomaly a bit. Then I’ll explain more technically, for the curious, how such a thing can happen in CRDB.
First of all, let’s restate our motto: if Nathan had have started his transaction after Tobi committed (in particular, if Nathan would have started his transaction because Tobi committed his), he would have seen both rows and things would have been good. An element that’s at play, and in fact is key here, is that Nathan’s transaction was concurrent with both mine and Tobi’s. According to the definition of strict serializability, Nathan’s transaction can be ordered in a bunch of ways with respect to the other two: it can be ordered before both of them, after both of them, or after mine but before Tobi’s. The only thing that’s required is that my transaction is ordered before Tobi’s. The violation of strict serializability that we detected here is not that Nathan’s transaction was mis-ordered, but that mine and Tobi’s (which are not concurrent) appear to have been reordered. Non-strict serializability allows this just fine.
My opinion is that this anomaly is not particularly bad because Nathan was not particularly expecting to see either of the two writes. But if this was my only argument, I’d probably stay silent.
There’s another important thing to explain: both my and Tobi’s transactions are, apart from their timing, unrelated: the sets of data they read and write do not overlap. If they were overlapping (e.g. if Tobi read my comment from the DB before inserting his), then serializability would not allow them to be reordered at all (and so CockroachDB wouldn’t do it and the anomaly goes away). In this particular example, if the schema of the hackernewscomments table would contain a self-referencing foreign key constraint (asking the database to ensure that child comments reference an existing parent), then the “reading” part would have been ensured by the system.
So, for this anomaly to occur, you need three transactions to play. Two of them need to appear to be independent of each other (but not really be, or otherwise we probably wouldn’t have noticed the anomaly) and the third needs to overlap both of them. I’ll let everybody judge for themselves how big of a deal this is. For what it’s worth, I don’t remember hearing a CRDB user complaining about it.
Beyond the theory, there are technical considerations that make producing this anomaly even more unlikely: given CockroachDB’s implementation, the anomaly is avoided not only if the read/write sets of my and Tobi’s transactions overlap, but also if the leadership of any of the ranges of data containing hackernewscomments rows 1 and 2 happens to be on the same node when these transactions occur, or if Nathan’s database client is talking to the same CockroachDB node as Tobi’s, and also in various other situations. Also, the more synchronized the clocks on the three nodes are, the less likely it is. Overall, this anomaly is pretty hard to produce even if you try explicitly.
As you might have guessed, I personally am not particularly concerned about this anomaly. Besides everything I’ve said, I’ll add a whataboutist argument and take the discussion back to friendly territory: consider this anomaly in contrast to the “stale reads” family of anomalies present in many other competing products. All these things are commonly bucketed under strict serializability / linearizability violations, but don’t be fooled into thinking that they’re all just as bad. Our anomaly needs three transactions doing a specific dance resulting in an outcome that, frankly, is not even that bad. A stale read anomaly can be seen much easier in a product that allows it. Examples are many; a colleague gave a compelling one recently: if your bank was using a database that allows stale reads, someone might deposit a check for you, at which point your bank would text you about it, and you’d go online to see your balance. You might see the non-updated balance and freak out. Banks should be using CockroachDB.
I’ve discussed the CockroachDB guarantees and violations of strict serializability. Our discussion used, laxly, the SQL language to illustrate things but the discussion used language and concepts from more theoretical literature. We bridged the gap by implying that SQL statements are really reads and writes used by some models. This section discusses some uses of CockroachDB/SQL that fall a bit outside the models we’ve used, but are surprising nevertheless. I think these examples will not fall nicely into the models used for the strict serializability definition, at least not without some effort into expanding the model.
Consider the following two transactions:
1. insert into foo (id, time) values (1, now()) 2. insert into foo (id, time) values (2, now())
Assuming these two transactions execute in this order, it is possible (and surprising) to read the rows back and see that the time value for row 2 is lower that the one for row one.
Perhaps it’s realistic to think that this happens in other systems too, even single-node systems, if the system clock jumps backwards (as it sometimes does), so perhaps there’s nothing new here.
as of system timequeries and backups
CockroachDB supports the (newer) standard SQL system-versioned tables; CockroachDB lets one “time travel” and query the old state of the database with a query like
select * from foo as of system time now()-10s. This is a fantastic, really powerful feature. But it also provides another way to observe a “causal reverse” anomaly. Say one ran these two distinct transactions, in this order:
1. insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment') 2. insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong')
It’s possible for an
as of system time query to be executed later and, if it’s unlucky in its choice of a “system time”, to see the second row and not the first.
Again, if the second transaction were to read the data written by the first (e.g. implicitly through a foreign key check), the anomaly would not be possible.
Relatedly, a backup, taken through the
backup database command, is using
as of system time queries under the hood, and so a particular backup might contain row 2 but not row 1.
The architecture of CockroachDB is based on a separation between multiple layers (a SQL layer on top down to a storage layer at the bottom). For the subject at hand, the interesting layer is the Transaction Layer, which is in charge of making sure that a transaction doesn’t miss writes that it’s supposed to be seeing. Each transaction has a timestamp, assigned by the “gateway node” — the node that a client happens to be talking to — when the transaction starts (through a SQL BEGIN statement). As the transaction talks to different other nodes that might be responsible for _ranges_ of data it wants to read, this timestamp is used to decide what values are visible (because they’ve been written by transactions “in the past”) and which values aren’t visible because they’ve been written “in the future”.
CockroachDB uses multi-version concurrency control (MVCC), which means that the history of each row is available for transactions to look through. The difficulties, with respect to consistency guarantees, stem from the fact that the timestamp recording into MVCC are taken from the clock of the gateway node that wrote it, which generally is not the same one as the gateways assigning transaction timestamp for a reader, and we assume that the clock can be desynchronized up to a limit (we call the phenomenon “clock skew”). So, given transaction timestamp t and value timestamp t', how does one decide whether the value in question should be visible or not?
The rules are that, if t' <= t, then the transaction will see the respective value (and so we’ll essentially order our transaction after that writer). The reasoning is that either our transaction really started after the other one committed, or, if not, the two were concurrent and so we can order things either way.
If t’ > t, then it gets tricky. Did the writer really start and commit before the reader began its transaction, or did it commit earlier than that but t’ was assigned by a clock that’s ahead of ours? What CRDB does is define an “uncertainty interval”: if the values are close enough so that t' could be explained by a trailing clock, we say that we’re unsure about whether the value needs to be visible or not, and our transaction needs to change its timestamp (which, unless we can avoid it, means the transaction might have to restart. Which, unless we can further avoid it, means the client might get a retriable error). This is what allows CockroachDB to guarantee no stale reads. In the Hacker News example, if Nathan starts his transaction after me and Tobi committed ours, the worst that could happen is that he gets a timestamp that’s slightly in the past and has to consider some of our other writes uncertain, at which point he’ll restart at a higher timestamp.
We work quite hard to minimize the effects of this uncertainty interval. For one, transactions keep track of what timestamps they’ve observed at each node and uncertainty is tracked between nodes pair-wise. This, coupled with the fact that a node’s clock is bumped up when someone tries to write on it with a higher timestamp, allows a transaction to not have to restart more than once because of an uncertain value seen on a particular node. Also, overall, once the maximum admissible clock skew elapses since a transaction started, a transaction no longer has any uncertainty.
Separately, when a transaction’s timestamp does need to be bumped, we try to be smart about it. If either the transaction hasn’t read anything before encountering the uncertain value, or if we can verify that there’s been no writes on the data its already read before encountering the uncertainty, then the transaction can be bumped with no fuss. If we can’t verify that, then the transaction needs to restart so it can perform its writes again. If it does have to restart, we don’t necessarily tell the client about it. If we haven’t yet returned any results for the transaction to the client (which is common if the client can send parts of a transaction’s statements as a batch), then we can re-execute all the transaction’s statements on the server-side and the client is none the wiser.
CockroachDB provides a high level of “consistency”, second only to Spanner among distributed databases as far as I know (but then CockroachDB is a more flexible and easy to migrate to database — think ORM support — so I’ll take it over Spanner any day). We offer a relatively easy to understand programming model, although the literature doesn’t give us a good name for it. It stronger than serializability, but somewhat weaker than strict serializability (and than linearizability, although using that term in the context of a transactional system is an abuse of the language). It’s probably easiest to qualify it by understanding the anomaly that it allows — “causal reverse” — and the limited set of circumstances under which it can occur. In the majority of cases where one might be wondering about semantics of reads and writes in CockroachDB, the slogan “no stale reads” should settle most discussions.
If data consistency is your thing, we’re hiring. Check out our careers page.
Although I think the definition of the Serializable isolation level would have benefitted from introducing some notion of different clients. As phrased by the SQL standard, I believe it technically allows empty results to be produced for any read-only transaction with the justification that those transactions are simply ordered before any other transaction. Implementing that would be egregious, though.[^2]: We’re thinking of ways to make CRDB resilient to more arbitrarily unsynchronized clocks.[^3]: As discussed in the “A note on clocks” section, figuring out what “afterwards” means is not always trivial when the clients involved are not on the same machine. But still, sometimes (in the cases that matter most), a transaction is known to happen after another one, usually through a causal relationship between the two. ↩︎
We recently announced general availability (GA) for Serverless, with support for change data capture (CDC), backup and …Read more
Almost a year ago, we wrote about our use of Jepsen in testing CockroachDB. As we prepare for CockroachDB 1.0, we …Read more
I recently gave a talk at KubeCon North America -- “Experience Report: Running a Distributed System …Read more