One of the headline features of CockroachDB is its full support for ACID transactions across arbitrary keys in a distributed database. CockroachDB transactions apply a set of operations to the database while maintaining some key properties: Atomicity, Consistency, Isolation, and Durability (ACID). In this post, we’ll be focusing on how CockroachDB enables atomic transactions without using locks.
Atomicity can be defined as:
For a group of database operations, either all of the operations are applied or none of them are applied.
Without atomicity, a transaction that is interrupted may only write a portion of the changes it intended to make; this may leave your database in an inconsistent state.
The strategy CockroachDB uses to provide atomic transactions follows these basic steps:
To begin a transaction, a writer first needs to create a transaction record. The transaction record is used by CockroachDB to provide the switch in our overall strategy.
Each transaction record has the following fields:
The writer generates a transaction record with a new UUID in the
PENDING state. The writer then uses a special CockroachDB command
BeginTransaction() to store the transaction record. The record is co-located (i.e. on the same nodes in the distributed system) with the key in the transaction record.
Because the record is stored at a single cockroach key, operations on it are strictly ordered (by a combination of raft and our underlying storage engine). The state of the transaction is the “on/off” state of switch, with states of
ABORTED representing “off,” and
COMMITTED representing “on.” The transaction record thus meets the requirements for our switch.
Note that the transaction state can move from
PENDING to either
COMMITTED, but cannot change in any other way (i.e.
COMMITTED are permanent states).
To stage the changes in a transaction, CockroachDB uses a structure called a write intent. Any time a value is written to a key as part of a transaction, it is written as a write intent.
This write intent structure contains the value that will be written if the transaction succeeds.
The write intent also contains the key where the transaction record is stored. This is crucial: If a reader encounters a write intent, it uses this key value to locate the transaction record (the switch).
As a final rule, there can only be a single write intent on any key. If there were multiple concurrent transactions, it would be possible for one transaction to try to write to a key which has an active intent from another transaction on it. However, transaction concurrency is a complicated topic which we will cover in a later blog post (on transaction isolation); for now, we will assume that there is only one transaction at a time, and that an existing write intent must be from an abandoned transaction.
When writing to a key which already has a write intent:
ABORTEDstate if it is still in the
PENDINGstate. If the earlier transaction was
ABORTED, do nothing.
When reading a key, we must follow principle 3 of our overall strategy and consult the value of any switch before returning a value.
If the key contains a plain value (i.e. not a write intent), the reader is assured that there is no transaction in progress that involves this key, and that it contains the most recent committed value. The value is thus returned verbatim.
However, if the reader encounters a write intent, it means that a previous transaction was abandoned at some point before removing the intent (remember: we are assuming that there is only one transaction at a time). The reader needs to check the state of the transaction’s switch (the transaction record) before proceeding.
ABORTEDstate if it is still in the
COMMITTED, the cleanup operation will have upgraded the staged value to the plain value; otherwise, this will return the original value of the key before the transaction.
To commit the transaction, the transaction record is updated to a state of
All write intents written by the transaction are immediately valid: any future reads which encounters a write intent for this transaction will filter through the transaction record, see that it is committed, and return the value that was staged in the intent.
A transaction can be aborted by updating the state of the transaction record to
ABORTED. At this point, the transaction is permanently aborted and future reads will ignore write intents created by this transaction.
The system above already provides the property of atomic commits; however, the filtering step is expensive, because it requires writes across the distributed system to filter through a central location (the transaction record). This is undesirable behavior for a distributed system.
Therefore, after a transaction is completed, we remove the write intents it created as soon as possible: if a key has a plain value without a write intent, read operations do not need to be filtered and thus complete in a properly distributed fashion.
The cleanup operation can be called on a write intent when the associated transaction is no longer pending. It follows these simple steps:
ABORTED, the write intent is removed.
COMMITTED, the write intent’s staged value is converted into the plain value of the key, and then the write intent is removed.
Cleanup is performed in the following cases:
By aggressively cleaning up expired write intents through multiple avenues, the necessary performance impact of filtering is minimized.
With that, we have covered CockroachDB’s basic strategy for ensuring the atomicity of its distributed, lockless transactions.
But there’s more to the story than just what I’ve covered here. CockroachDB supports concurrent transactions which may write to overlapping sets of keys. Allowing overlapping, concurrent transactions is the “I” in ACID, which guarantees isolation. We’ll cover the details of how we accomplish isolation in a future post. Stay tuned!
Are distributed transactions your jam? Our engineering team is hiring! Check out our open positions here.