A version of this blog post was originally published on May 1, 2017 and has been modified to provide the newest information available.
With the recent 1.0 release, CockroachDB is now a production-ready database. 1.0 showcases the core capabilities of CockroachDB, while also offering users improved performance and stability with a cloud-native architecture that flexibly supports all manner of cloud deployments. It encompasses the core features that allow our users to run CockroachDB successfully in production. Now that the dust has settled on our 1.0 release, I wanted to share how we defined our target use case and dive into the actual product features that support running that use case in production.
Ask any one of our engineers, and they’ll say that CockroachDB is the database they wish they had before joining Cockroach Labs. Many of them are keenly familiar with the horror stories of sharding SQL databases or cleaning up after NoSQL databases that take haphazard approaches to consistency and integrity.
The underlying issue is that existing databases are outdated. They can’t keep up with users who access data more frequently from different locations and across multiple platforms. They also require complex configuration and setup, making it difficult to take full advantage of the flexibility and power of the cloud. Instead of supporting business growth and innovation, databases often become a source of developer friction and technical debt. We knew we wanted to build a better database that addressed these issues, but we also wanted to make sure that we were addressing the needs of the larger community. That meant spending time getting to know our early users.
We started out by conducting user interviews across a variety of seniority levels and industries to understand how users make database decisions. Given that databases support mission-critical applications, developers and operators prefer to work with familiar products even if they require manual workarounds and incur high operational costs. Further, the cost of switching databases is high - there is the immediate cost of reworking existing applications and the hidden cost of developing institutional knowledge to build on and maintain a new database.
Being a better alternative to existing market solutions would not be enough. Instead, we needed to enable a new use case where the value proposition of CockroachDB was so compelling that developers and operators would build their most mission critical applications on top of it.
We homed in on distributed OLTP as our core use case. OLTP (Online Transaction Processing) covers workloads that are transactional in nature, including financial transaction processing, inventory management, e-commerce, and a wide variety of web applications. OLTP databases usually support mission critical applications that drive direct business success. Companies need their OLTP databases to grow as they grow while keeping their data safe and available at all times. To achieve this, companies end up “distributing” their databases to span more than one server, often sacrificing transactions in the process.Deployments that span multiple datacenters get even more complicated, with complicated asynchronous replication setups that require significant overhead to deploy and maintain.
CockroachDB enables distributed OLTP by offering the strong consistency of SQL in combination with the scalability of NoSQL. This allows businesses to scale their OLTP databases without having to make application-level changes even as their database grows across multiple datacenters.
However, the above discussion fails to cover a larger benefit that CockroachDB offers: the ability to build a cloud-native architecture that properly leverages the cloud. CockroachDB come with no single points of failure, the ability to survive disasters, automatable operations, horizontal scalability, and no vendor lock-in. With CockroachDB, users can add machines to their clusters and watch as CockroachDB automatically rebalances data across nodes. Replication also comes built-in, so that even if nodes fail, other nodes can pick up the slack, continuing to return correct and consistent data. CockroachDB also works across multiple datacenters, so customers can expect their data to stay safe even in the face of a datacenter-wide failure.
Having decided on our target use case, we needed to define what it meant to be production ready. For us, this meant building out the core differentiating features that would support a cloud-native distributed OLTP deployment, coupled with stability, performance, and usability.
The two core features we settled on that were vital to distributed OLTP use cases were distributed SQL and multi-active availability. The SQL interface offers a powerful and familiar API that developers can use to define their transactional workloads and comes with a robust community and existing ecosystem of tools. CockroachDB supports a large fraction of the standard SQL footprint, including secondary indexes, foreign keys, JOINs, and aggregations. CockroachDB also supports distributed ACID transactions and query execution. To application developers, a CockroachDB cluster appears as a single logical database. To operators, a CockroachDB cluster can grow with data storage and throughput needs by simply adding symmetric CockroachDB nodes.
In addition to supporting the SQL API, we also wanted to support a distributed deployment of CockroachDB in a manner that maintains strong consistency, fault tolerance, and scalability. That leads us into our second core feature, multi-active availability.
Multi-active availability is an evolution from primary-secondary and active-active setups to a deployment which uses consensus-based replication to consistently replicate amongst three or more replicas, all actively serving R/W client traffic. Multi-active availability allows client traffic to be dynamically load balanced to utilize available replicas, instead of relying on fragile failover mechanisms. Because it employs strongly-consistent replication, multi-active availability avoids stale reads and the need for conflict resolution. Replication is strongly consistent at the range level with each range taking part in a Raft consensus group. A typical CockroachDB deployment across three datacenters can automatically scale and rebalance as capacity is added and survive datacenter level disasters.
With our 1.0 release, we wanted our users to not only be able to test and utilize our core features, but also feel comfortable running us in production supporting operational workloads with minimal downtime. We run multiple CockroachDB clusters across numerous cloud deployments under various chaos situations to ensure stability and availability. We also run rigorous testing suites before any code is committed and engaged Kyle Kingsbury to run his Jepsen tests against CockroachDB to guarantee data integrity.
While running us in production, our users can also expect reduced downtime with support for no downtime rolling upgrades and certificate rotations. Our CockroachDB Enterprise backup and restore offering allows users to run periodic and incremental backups efficiently, while restoring in a distributed fashion.
As with any database utilizing consensus-based replication, CockroachDB queries incur network latency when achieving quorum on reads and writes. We have done significant work to improve query performance with read leases, parallelized SQL execution, and optimizations under high contention scenarios. For users looking to deploy us across multiple datacenters, load-based lease rebalancing allows ranges that receive the most traffic to serve faster reads.
In order to enable distributed SQL, CockroachDB uses a sorted mechanism with data broken up into ranges instead of a hashing mechanism to distribute data. This allows us to do efficient scans and support typical SQL functionalities. We also built a distributed SQL layer that processes read queries in a distributed fashion to achieve a significant increase in query speeds.
In addition to stability and performance, we wanted to make it easier for developers and operators to adopt CockroachDB. We leverage the PostgreSQL wire protocol to tap into the existing and robust PostgreSQL ecosystem. Developers can use existing client drivers and ORMs to interface with CockroachDB, to the extent that they are compatible with CockroachDB. CockroachDB is also very flexible and configurable, making it easy for operators to meet modern availability requirements. Operators can specify zone configurations at a cluster, database, and table level. They can also deploy CockroachDB across clouds or with popular container orchestration tools.
CockroachDB is still a relatively young database and there are several caveats to keep in mind for 1.0. First of all, although CockroachDB uses the PostgreSQL wire protocol, it doesn’t support all of PostgreSQL’s functionality - particularly PostgreSQL-specific extensions. This means that not all tools that work with PostgreSQL will work with CockroachDB. Second, a distributed system comes with network latency. Certain data models and workloads come with a higher cost. Thirdly, there are still many optimizations to be made at the SQL layer for faster queries, so CockroachDB in its current form is a better fit for transactional workloads versus analytical workloads. As with any 1.0 release, it is important to first test specific workloads against CockroachDb before jumping into a production environment.
We are big fans of the open source community and there are several channels available to users for help. Our Gitter channel, forum, and GitHub are actively monitored by our engineering team. Please engage with us with any issues and suggestions you have.
CockroachDB 1.0 is just the first step towards building a cloud-native SQL database. We will continue improving, optimizing, and testing our database internally, and hope to hear from our community running us in production as well.
Moving forward, we have several exciting new features that will only continue to make running databases at a global scale easier for everyone. This includes improved query performance, more integrations with deployment tooling, and more powerful deployment configurations. We expect our coverage of use cases to only expand with time. We will be sharing our upcoming 1.1 roadmap soon, and look forward to receiving your feedback.