How CockroachDB does distributed, atomic transactions

Editor's Note - April 23, 2021: This article was written in 2015 when CockroachDB was pre-beta. The product has evolved significantly since then. We will be updating this post to reflect the current status of CockroachDB. In the meantime, the transaction section of the Architecture Document provides a more current description of CockroachDB's transaction model.

---------------

One of the headline features of CockroachDB is its full support for ACID transactions across arbitrary keys in a distributed database. CockroachDB transactions apply a set of operations to the database while maintaining some key properties: Atomicity, Consistency, Isolation, and Durability (ACID). In this post, we’ll be focusing on how CockroachDB enables atomic transactions without using locks.

Atomicity can be defined as:

For a group of database operations, either all of the operations are applied or none of them are applied.

Without atomicity, a transaction that is interrupted may only write a portion of the changes it intended to make; this may leave your database in an inconsistent state.

Strategy

The strategy CockroachDB uses to provide atomic transactions follows these basic steps:

Switch: Before modifying the value of any key, the transaction creates a switch, which is a writeable value distinct from any of the real values being changed in the batch. The switch cannot be concurrently accessed – reads and writes of the switch are strictly ordered. The switch is initially “off,” and it can be switched to “on.”
Stage: The writer prepares several changes to the database, but does not overwrite any existing values; the new values are instead staged in proximity to the original values.
Filter: For any key with a staged value, reads for that key must check the state of the transaction’s switch before returning a value. If the switch is “off,” the reader returns the original value of the key. If the switch is “on,” the reader returns the staged value. Thus, all reads of a key with a staged value are filtered through the switch’s state.
Flip: When the writer has prepared all changes in the transaction, the writer flips the switch to the “on” position. In combination with the filtering, all values staged as part of the transaction are immediately returned by any future reads.
Unstage: Once a transaction is completed (either aborted or committed), the staged values are cleaned up as soon as possible. If the transaction succeeded, then the original values are replaced by the staged values; on failure, the staged values are discarded. Note that unstaging is done asynchronously and does not need to have finished before the transaction is considered committed.

The Detailed Transaction Process

Switch: CockroachDB Transaction Record

To begin a transaction, a writer first needs to create a transaction record. The transaction record is used by CockroachDB to provide the switch in our overall strategy.

Each transaction record has the following fields:

A Unique ID (UUID) which identifies the transaction.
A current state of PENDING, ABORTED, or COMMITTED.
A cockroach K/V key. This determines where the “switch” is located in the distributed data store.

The writer generates a transaction record with a new UUID in the PENDING state. The writer then uses a special CockroachDB command BeginTransaction() to store the transaction record. The record is co-located (i.e. on the same nodes in the distributed system) with the key in the transaction record.

Because the record is stored at a single cockroach key, operations on it are strictly ordered (by a combination of raft and our underlying storage engine). The state of the transaction is the “on/off” state of switch, with states of PENDING or ABORTED representing “off,” and COMMITTED representing “on.” The transaction record thus meets the requirements for our switch.

Note that the transaction state can move from PENDING to either ABORTED or COMMITTED, but cannot change in any other way (i.e. ABORTED and COMMITTED are permanent states).

Stage: Write Intents

To stage the changes in a transaction, CockroachDB uses a structure called a write intent. Any time a value is written to a key as part of a transaction, it is written as a write intent.

This write intent structure contains the value that will be written if the transaction succeeds.

The write intent also contains the key where the transaction record is stored. This is crucial: If a reader encounters a write intent, it uses this key value to locate the transaction record (the switch).

As a final rule, there can only be a single write intent on any key. If there were multiple concurrent transactions, it would be possible for one transaction to try to write to a key which has an active intent from another transaction on it. However, transaction concurrency is a complicated topic which we will cover in a later blog post (on transaction isolation); for now, we will assume that there is only one transaction at a time, and that an existing write intent must be from an abandoned transaction.

When writing to a key which already has a write intent:

Move the transaction record for the existing intent to the ABORTED state if it is still in the PENDING state. If the earlier transaction was COMMITTED or ABORTED, do nothing.
Clean up the existing intent from the earlier transaction, which will remove the intent.
Add a new intent for the concurrent transaction.

Filter: Reading an Intent

When reading a key, we must follow principle 3 of our overall strategy and consult the value of any switch before returning a value.

If the key contains a plain value (i.e. not a write intent), the reader is assured that there is no transaction in progress that involves this key, and that it contains the most recent committed value. The value is thus returned verbatim.

However, if the reader encounters a write intent, it means that a previous transaction was abandoned at some point before removing the intent (remember: we are assuming that there is only one transaction at a time). The reader needs to check the state of the transaction’s switch (the transaction record) before proceeding.

Move the transaction record for the existing intent to the ABORTED state if it is still in the PENDING state.
Clean up the existing intent from the earlier transaction, which will remove the intent.
Return the plain value for the key. If the earlier transaction was COMMITTED, the cleanup operation will have upgraded the staged value to the plain value; otherwise, this will return the original value of the key before the transaction.

Flip: Commit the Transaction

To commit the transaction, the transaction record is updated to a state of COMMITTED.

All write intents written by the transaction are immediately valid: any future reads which encounters a write intent for this transaction will filter through the transaction record, see that it is committed, and return the value that was staged in the intent.

Aborting a Transaction

A transaction can be aborted by updating the state of the transaction record to ABORTED. At this point, the transaction is permanently aborted and future reads will ignore write intents created by this transaction.

Unstage: Cleaning up Intents

The system above already provides the property of atomic commits; however, the filtering step is expensive, because it requires writes across the distributed system to filter through a central location (the transaction record). This is undesirable behavior for a distributed system.

Therefore, after a transaction is completed, we remove the write intents it created as soon as possible: if a key has a plain value without a write intent, read operations do not need to be filtered and thus complete in a properly distributed fashion.

Cleanup Operation

The cleanup operation can be called on a write intent when the associated transaction is no longer pending. It follows these simple steps:

If the transaction is ABORTED, the write intent is removed.
If the transaction is COMMITTED, the write intent’s staged value is converted into the plain value of the key, and then the write intent is removed.
The cleanup operation is idempotent; that is, if two processes try to clean up an intent for the same key and transaction, the second operation will be a no-op.

Cleanup is performed in the following cases:

After a writer commits or aborts a transaction, it attempts to clean up every intent it wrote immediately.
When a write encounters another write intent from an earlier transaction.
When a read encounters a write intent from an earlier transaction.

By aggressively cleaning up expired write intents through multiple avenues, the necessary performance impact of filtering is minimized.

Wrap Up

With that, we have covered CockroachDB’s basic strategy for ensuring the atomicity of its distributed, lockless transactions.

But there’s more to the story than just what I’ve covered here. CockroachDB supports concurrent transactions which may write to overlapping sets of keys. Allowing overlapping, concurrent transactions is the “I” in ACID, which guarantees isolation. We’ll cover the details of how we accomplish isolation in a future post. Stay tuned!

Are distributed transactions your jam? Our engineering team is hiring! Check out our open positions here.