CockroachDB was designed to create the source-available database we would want to use: one that is both scalable and consistent. Developers often have questions about how we've achieved this, and this guide sets out to detail the inner workings of the
cockroach process as a means of explanation.
However, you definitely do not need to understand the underlying architecture to use CockroachDB. These pages give serious users and database enthusiasts a high-level framework to explain what's happening under the hood.
If these docs interest you, consider taking the free Intro to Distributed SQL course on Cockroach University.
Using this guide
This guide is broken out into pages detailing each layer of CockroachDB. We recommended reading through the layers sequentially, starting with this overview and then proceeding to the SQL layer.
If you're looking for a high-level understanding of CockroachDB, you can read the Overview section of each layer. For more technical detail — for example, if you're interested in contributing to the project — you should read the Components sections as well.
This guide details how CockroachDB is built, but does not explain how to build an application using CockroachDB. For more information about how to develop applications that use CockroachDB, check out our Developer Guide.
Goals of CockroachDB
CockroachDB was designed to meet the following goals:
- Make life easier for humans. This means being low-touch and highly automated for operators and simple to reason about for developers.
- Offer industry-leading consistency, even on massively scaled deployments. This means enabling distributed transactions, as well as removing the pain of eventual consistency issues and stale reads.
- Create an always-on database that accepts reads and writes on all nodes without generating conflicts.
- Allow flexible deployment in any environment, without tying you to any platform or vendor.
- Support familiar tools for working with relational data (i.e., SQL).
CockroachDB starts running on machines with two commands:
cockroach startwith a
--joinflag for all of the initial nodes in the cluster, so the process knows all of the other machines it can communicate with.
cockroach initto perform a one-time initialization of the cluster.
Once the CockroachDB cluster is initialized, developers interact with CockroachDB through a PostgreSQL-compatible SQL API. Thanks to the symmetrical behavior of all nodes in a cluster, you can send SQL requests to any node; this makes CockroachDB easy to integrate with load balancers.
After receiving SQL remote procedure calls (RPCs), nodes convert them into key-value (KV) operations that work with our distributed, transactional key-value store.
As these RPCs start filling your cluster with data, CockroachDB starts algorithmically distributing your data among the nodes of the cluster, breaking the data up into chunks that we call ranges. Each range is replicated to at least 3 nodes by default to ensure survivability. This ensures that if any nodes go down, you still have copies of the data which can be used for:
- Continuing to serve reads and writes.
- Consistently replicating the data to other nodes.
If a node receives a read or write request it cannot directly serve, it finds the node that can handle the request, and communicates with that node. This means you do not need to know where in the cluster a specific portion of your data is stored; CockroachDB tracks it for you, and enables symmetric read/write behavior from each node.
Any changes made to the data in a range rely on a consensus algorithm to ensure that the majority of the range's replicas agree to commit the change. This is how CockroachDB achieves the industry-leading isolation guarantees that allow it to provide your application with consistent reads and writes, regardless of which node you communicate with.
Ultimately, data is written to and read from disk using an efficient storage engine, which is able to keep track of the data's timestamp. This has the benefit of letting us support the SQL standard
AS OF SYSTEM TIME clause, letting you find historical data for a period of time.
While the high-level overview above gives you a notion of what CockroachDB does, looking at how CockroachDB operates at each of these layers will give you much greater understanding of our architecture.
At the highest level, CockroachDB converts clients' SQL statements into key-value (KV) data, which is distributed among nodes and written to disk. CockroachDB's architecture is manifested as a number of layers, each of which interacts with the layers directly above and below it as relatively opaque services.
The following pages describe the function each layer performs, while mostly ignoring the details of other layers. This description is true to the experience of the layers themselves, which generally treat the other layers as black-box APIs. There are some interactions that occur between layers that require an understanding of each layer's function to understand the entire process.
|SQL||1||Translate client SQL queries to KV operations.|
|Transactional||2||Allow atomic changes to multiple KV entries.|
|Distribution||3||Present replicated KV ranges as a single entity.|
|Replication||4||Consistently and synchronously replicate KV ranges across many nodes. This layer also enables consistent reads using a consensus algorithm.|
|Storage||5||Read and write KV data on disk.|
The requirement that a transaction must change affected data only in allowed ways. CockroachDB uses "consistency" in both the sense of ACID semantics and the CAP theorem, albeit less formally than either definition.
The degree to which a transaction may be affected by other transactions running at the same time. CockroachDB provides the
SERIALIZABLE isolation level, which is the highest possible and guarantees that every committed transaction has the same result as if each transaction were run one at a time.
The process of reaching agreement on whether a transaction is committed or aborted. CockroachDB uses the Raft consensus protocol. In CockroachDB, when a range receives a write, a quorum of nodes containing replicas of the range acknowledge the write. This means your data is safely stored and a majority of nodes agree on the database's current state, even if some of the nodes are offline.
When a write does not achieve consensus, forward progress halts to maintain consistency within the cluster.
The process of creating and distributing copies of data, as well as ensuring that those copies remain consistent. CockroachDB requires all writes to propagate to a quorum of copies of the data before being considered committed. This ensures the consistency of your data.
A set of operations performed on a database that satisfy the requirements of ACID semantics. This is a crucial feature for a consistent system to ensure developers can trust the data in their database. For more information about how transactions work in CockroachDB, see Transaction Layer.
A state of conflict that occurs when:
- A transaction is unable to complete due to another concurrent or recent transaction attempting to write to the same data. This is also called lock contention.
- A transaction is automatically retried because it could not be placed into a serializable ordering among all of the currently executing transactions. This is also called a serializability conflict. If the automatic retry is not possible or fails, a transaction retry error is emitted to the client, requiring the client application to retry the transaction.
Steps should be taken to reduce transaction contention in the first place.
A consensus-based notion of high availability that lets each node in the cluster handle reads and writes for a subset of the stored data (on a per-range basis). This is in contrast to active-passive replication, in which the active node receives 100% of request traffic, and active-active replication, in which all nodes accept requests but typically cannot guarantee that reads are both up-to-date and fast.
A SQL user is an identity capable of executing SQL statements and performing other cluster actions against CockroachDB clusters. SQL users must authenticate with an option permitted on the cluster (username/password, single sign-on (SSO), or certificate). Note that a SQL/cluster user is distinct from a CockroachDB Cloud organization user.
CockroachDB architecture terms
A group of interconnected CockroachDB nodes that function as a single distributed SQL database server. Nodes collaboratively organize transactions, and rebalance workload and data storage to optimize performance and fault-tolerance.
Each cluster has its own authorization hierarchy, meaning that users and roles must be defined on that specific cluster.
A CockroachDB cluster can be run in CockroachDB Cloud, within a customer Organization, or can be self-hosted.
An individual instance of CockroachDB. One or more nodes form a cluster.
CockroachDB stores all user data (tables, indexes, etc.) and almost all system data in a sorted map of key-value pairs. This keyspace is divided into contiguous chunks called ranges, such that every key is found in one range.
From a SQL perspective, a table and its secondary indexes initially map to a single range, where each key-value pair in the range represents a single row in the table (also called the primary index because the table is sorted by the primary key) or a single row in a secondary index. As soon as the size of a range reaches the default range size, it is split into two ranges. This process continues for these new ranges as the table and its indexes continue growing.
A copy of a range stored on a node. By default, there are three replicas of each range on different nodes.
For most types of tables and queries, the leaseholder is the only replica that can serve consistent reads (reads that return "the latest" data).
The consensus protocol employed in CockroachDB that ensures that your data is safely stored on multiple nodes and that those nodes agree on the current state even if some of them are temporarily disconnected.
For each range, the replica that is the "leader" for write requests. The leader uses the Raft protocol to ensure that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. The Raft leader is almost always the same replica as the leaseholder.
A time-ordered log of writes to a range that its replicas have agreed on. This log exists on-disk with each replica and is the range's source of truth for consistent replication.
Start by learning about what CockroachDB does with your SQL statements at the SQL layer.