How we stress test and benchmark CockroachDB for global scale

Scale where others fail. That’s the CockroachDB tagline, but it’s more than just words to us. We built CockroachDB to be a reliable, performant, highly available database for mission-critical workloads at global scale. But is it? How do we know what the database is actually capable of?

Of course, we get valuable feedback and information from global-scale CockroachDB customers like DoorDash, Netflix, and a number of Fortune 50 banks. But we also keep close tabs on things ourselves by regularly stress-testing and benchmarking CockroachDB.

We are often asked what testing we do. The short answer is: a lot. We run many different synthetic benchmarks, e.g. TPC-C, TPC-E, KV, and we also test higher complexity workloads that are a closer representation of our clients’ workloads.

This blog post presents an overview of what we do regularly. We’ll cover the benchmarks we run daily, weekly, and for every major release, as well as give some examples of ad hoc tests we’ve run as well as tests we’ve run in partnership with customers. We’ll also share some of the results (with the caveat that these numbers are true as of this writing – performance may change and improve over time).

Common benchmarks we run

As mentioned above, we do perform high-complexity testing that’s designed to closely mimic customer workloads. We’ll discuss those in more depth later in this post, but first we should define several of the benchmarks we run regularly:

TPC-C is an industry standard, highly scalable transactional database throughput benchmark. TPC-C is modeled after an order entry use case and involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. TPC-C is measured in transactions per minute (tpmC).

TPC-E is similar to TPC-C but is modeled after a brokerage firm use case with customers generating transactions related to trades, account inquiries, and market research.

KV is our simple key-value workload, akin to YCSB, reading and writing keys chosen at random, allowing for several supported (key) distributions, batch sizes, percent of reads vs. writes, concurrency and more.

With those definitions out of the way, here’s a summary of our regular benchmark cadence:

Daily benchmarks

We run hundreds of benchmarks daily, and an exhaustive list is outside the scope of this post. However, here we’ll focus on TPC-C/TPC-E, as they are very representative of the workloads we see from our large customers in verticals like finance and ecommerce:

TPC-C: 3 Nodes, 16 VCPU, 3500 Warehouses, 1TB. This test verifies that the number of transactions per second the cluster can sustain, which is around 3,000 tnx/s (GCE) and 3,500 txn/s (AWS).
TPC-E: 10 Nodes, 8 VCPU, 8TB. This test verifies the average throughput per node for the restore. On AWS it’s able to achieve 80 MB/Second/Node. Total Restore Time = 8TB/80MB/10 ~2.8 Hours.

Weekly benchmarks

We also run larger weekly TPC-C and TPC-E benchmarks:

TPC-C: 12 Nodes, 16 VCPU, 11500 Warehouses, 3.6TB. This test verifies that the number of transactions per second a larger cluster can sustain, which is around 10,000 tnx/s (GCE).
TPC-E: 15 Nodes, 16 VCPU, 32TB Replicated Restore Test. This test verifies the average throughput per node for the restore on this larger cluster. On AWS it’s able to achieve 60 MB/Second/Node. Total Restore Time = 32TB/60MB/15 ~9.8 Hours. On GCE it’s 50 MB/Second/Node. Total Restore Time = 32TB/50MB/15 ~11.9 Hours.

Per-release benchmarks

Before every major release of CockroachDB, we run a number of large-scale tests to verify the performance and limits of CockroachDB, as well as test the new capabilities of the system at scale. There is a lot of variance in these tests, as they focus on different aspects of the system depending on what the new release contains.

Below, we’ll detail several of the tests we’ve run for the 23.1 and 23.2 releases as an example of what release testing looks like:

23.1 Long-running cluster testing

We performed roughly one month of ad-hoc testing against a GCE n2-standard-8 (8 VCPU), 9 node single-region, multiple AZs (us-east1-b, us-east1-c, us-east1-d) cluster that was continuously running TPC-E while being concurrently subjected to typical cluster operations (backup/restore, changefeeds, upgrades, downgrades, mixed version cluster state, decommissioning, schema changes – create index, add/remove column).

The focus of the testing was to ensure functional correctness and validate expected behavior.

23.2 Long-running cluster testing

We performed about six months of ad-hoc testing against 15 node, 1TB pd-ssd clusters (both multi-region in CockroachDB Dedicated and single-region self-hosted) that were continuously running KV and TPC-C while being concurrently subjected to typical cluster operations (backup/restore, changefeeds, physical replication, upgrades, mixed version cluster state, rolling restarts, schema changes).

The focus of the testing was to ensure functional correctness, that the system behaved as expected, and to identify and drill down on latency issues.

23.1 Large-scale cluster testing

We tested a customer-inspired workload for three months on an AWS Dedicated, 21 node, 3 region cluster (7 nodes in each of eu-central-1, eu-west-1, eu-west-2) with 16vCPU, 64GiB RAM, and 5000 GiB disk per node. Data was 2.5 TB unreplicated x5 = 12.5 TB replicated (Actual: 3.8TiB, 1.297B rows) consisting of product ID + JSON data up to 5k bytes/row and uniformly distributed by PID across the cluster. The workload was mostly point reads up to 1 Million/sec with a small number of concurrent insertions (2k/day).

The focus of the testing was to verify resiliency, scalability, and performance throughput/stability when regional network failures, schema changes, changefeeds, backup/restore, and other conditions/operations were applied.

23.2 Large-scale cluster testing

We tested various KV benchmark workloads up to 100k/qps for 3+ months on a 50 node cluster in AWS (us-west-2) with 16vCPU, 64GiB RAM and 5000 GiB disk per node. The focus of the testing was to verify resiliency, scalability, and performance throughput/stability when various features were used/applied at scale. These included expiration-based leases, changefeeds, UI/observability, 10 TB node densities, multi-store, schema (index) changes, row-level TTL, upgrades (from 23.1), rolling restarts, backup/restore, column-level encryption, and at-rest encryption.

23.1 TPC-C 81 node benchmark testing

We performed our standard large scale TPC-C test with 81 c5d.9xlarge nodes (36 VCPU), 140,000 warehouses, 1.7 million transactions per minute (tpmC) and 11.2 TB of data. The focus of the testing was on verifying raw performance, price/performance, and performance predictability scales as expected as the cluster grows.

Ad hoc testing

In addition to all of the above tests, which are performed on a regular cadence, we also undertake a significant amount of ad hoc testing to better understand CockroachDB’s performance at scale as we iterate and improve on it. Examples of recent ad hoc tests include:

Scalability testing with 20 to 200 nodes (single and multi-region), 32 VCPU, KV workload, with various read-write ratios, 4 billion rows, ~13.5 TiB of replicated data. Tested max throughput within a given latency budget. For example, for a write-only workload, demonstrated 160k QPS with 20 nodes and 1.2M QPS with 200 nodes (with headroom). For a 99% read workload, these tests achieved 400k QPS with 20 nodes.
Testing with 20 to 30 node single and multi-region clusters, 32 VCPU, workload simulating operations on a social graph using sysbench, with a varying percent of writes. For example, for a 91% read workload, the tests achieved 187k QPS. Tested max throughput within a given latency budget and examined the impact of common background operations (Backup, CDC, node addition/decommission and rolling restarts) on the performance of foreground application traffic.
Tested throughput and latency performance with multiple secondary indices, with CockroachDB’s indexes workload, on a 5 node single-region cluster, 32 VCPU. The data included 100 million rows (368 bytes each) with a varied number of secondary indices.
TPC-E testing with 15 nodes, 48 VCPU, 15TB pd-ssd per node. The dataset represented 2,000,000 customers, with 600,000 active customers (45TB replicated cold data and 15TB replicated active set). Transaction rate: 12K transactions per second. Tests included import, offline and online index creation, various schema changes, draining and decommissioning, CDC, and Backup.
TPC-E testing with 96 nodes, 30 VCPU; the same 45TB replicated dataset described above was imported, and the whole dataset was used as the active set. In these experiments we reached 40K transactions per second. Note that each transaction includes multiple (~10) complex queries (e.g., this or this), and even the simpler ones were 4-5 way joins.
Every change to the database expected to impact performance is tested separately, and we continuously run (micro)benchmarks to detect subtle performance changes to individual system components. For example, in 23.2 we tested the performance overhead of expiration-based leases, a reliability enhancement. Since this feature induces a per-range overhead, we evaluated scalability with the number of ranges. As a result, in 23.2, we will not recommend its use for workloads with more than 10,000 ranges per node (at 512 MiB per range). This testing informed our 24.1 plans to develop further optimizations in this area.

Customer-partnered benchmarking

In addition to everything described above, we also routinely partner directly with customers to test their use cases and workloads at their current or expected scale and stress level. Recent anonymized examples of customers that have or are in the process of partnering with us in this regard include:

An IaaS company with 20 CockroachDB clusters at ~100+ nodes each with the largest being 160 nodes. Storage density across clusters ranges from 2.5 - 10 TB/node paired with a minimum of 22 VCPUs/node. Total throughput is ~200 ops/sec/VCPU of which 70% are reads and 30% writes/deletes. The workload is mostly point inserts/reads but also includes more complex queries such as a distributed scan. Their largest table has 80 billion records. The customer is currently on v22.2 with plans to move to 23.1.
An e-commerce company with 335 clusters. Their current largest is a single region cluster with 96 m6i.12xlarge (48 VCPU) nodes (they have run a cluster with as high as 225 nodes in production) performing 3m point reads/sec or 3m writes/sec. Their current largest cluster by data volume is 280TB (650TB at peak) with 1,323,000 ranges and a 122TB table. The customer is also our largest user of changefeeds with ~900 total deployed across multiple clusters.
Another e-commerce company with a 36 m6i.4xlarge (16 VCPU) node multi-region cluster in eu-central-1, eu-west-1, eu-west-2 on AWS Dedicated. The cluster currently serves 100k–300k QPS.
A financial service company with an 18 node, 3 region cluster in the USA. 2 nodes per AZ, M6in.8Xlarge with 500G per node/9TB total storage, 576 vCPU running version 22.2.5. Their workload is bursty, maxing out at market open at 50k operation executions/second (2 - 7 read or write operations per transaction) with 400 ms latency.

As we’ve mentioned, these are just illustrative examples, and we perform quite a bit of testing that isn’t described here.

Curious about the kind of performance CockroachDB could offer for your workload? Let’s talk about it.