Why CockroachDB runs managed services on Kubernetes

*Note: this post originally ran in 2020, at the very beginning of our managed service/multi-tenant engineering journey. If you want an update on what kind of deployment models CockroachDB offers and what our current capabilities are check out our latest release blog.

There’s this really fun game I like to play. You get a bunch of SREs in a room and you see how quickly you can rile them up. Here are some things to say next time you’re in a room of SREs:

“SRE team’s only job is to keep the service in SLO.”
“This SLO doesn’t mean anything anyway so I don’t care about it.”

If you say one of these things, you’ll have all the other SREs in the room screaming and sweating and breaking their laptops over their knees in no time. For extra fun, pair with a friend and each take different sides!

Another topic that works really well is Kubernetes:

“You aren’t on Kubernetes? Wow, that’s a mistake.”
“You are on Kubernetes? Wow, that’s a mistake.”

See this thread on Hacker News for more fun.

In late 2018 Cockroach Labs began building a managed service offering called CockroachDB Dedicated. So for many months the SRE team evaluated different automation and orchestration technologies on top of which to build CockroachDB Dedicated. Eventually, we turned our attention Kubernetes. We didn’t break any laptops over our knees but it was quite the journey nonetheless. Each SRE started with one perspective, loaded up their cocoon with Kubernetes docs and tech talks and bug reports, spent a month in the darkness of the cocoon thinking deep thoughts, and finally emerged as a beautiful Kubernetes butterfly with that strange Kubernetes boat logo tattooed on their wings.

In the end, we chose Kubernetes as our automation technology. This blog post describes the journey to Kubernetes in detail. Or you can watch this video to see me explain our decision in person:

Container orchestration at Google and at Cockroach Labs

I started my career as an SRE at Google. At Google, we used Borg, a container orchestration system that is internal to Google and was the predecessor to Kubernetes. Users of Borg declare that they want to run such and such binary with such and such compute requirements (10 instances, 2 vCPU and 4 GB of RAM per instance), and Borg decides what physical machines the applications should run on and takes care of all low-level details such as creating containers, keeping the binary running, etc.

I was used to this model when I joined Cockroach Labs. Since Kubernetes was widely used in industry, why wouldn’t we use it at Cockroach Labs? An experienced teammate pointed out that even if Borg and Kubernetes are similar, the fact that Borg is a good fit for Google doesn’t imply that Kubernetes is a good fit for Cockroach Labs. The key point my teammate was making is that Google and Cockroach Labs are not the same!

I began to think more about the differences between Google and Cockroach Labs. Here are some things that are true of Google:

Google runs its own data centers.
Google runs as wide a variety of services as any company on this Earth. There are big services like websearch and YouTube, which are made of many microservices and operated by hundreds of engineers, and there are small services like code source repositories, which are operated by teams of fifteen or so engineers. There are also user facing parts of services, such as the system that serves websearch results to users and batch parts of services, such as the system that crawls and indexes the web so as to keep results fresh and relevant.
Google has many software engineers working for them. Internet searches ballpark the number of engineers between 5K and 20K.

Here are some things that are true of Cockroach Labs:

CockroachDB Dedicated leverages public clouds such as GCP and AWS.
CockroachDB Dedicated provides a single service: CockroachDB as a service. Basically, the whole engineering organization at Cockroach Labs works on CockroachDB. CockroachDB is a single binary; in order to provide survivability and scalability, multiple “nodes” are run, which join together to form a “cluster”.
Cockroach Labs has around forty engineers right now.

Not the same at all!

In the Google case, a major benefit of container orchestration is that it decouples machine planning from the concerns of the software engineering teams. Google can plan compute in aggregate; then software engineering teams can use Borg to schedule their heterogeneous workloads onto machines. Borg will binpack the workloads so as to use compute efficiently; multiple workloads will run on one machine in general. If software engineers had to work at the level of individual machine, Google would not use compute nearly as efficiently. This is a very big problem when you are at Google’s scale.

On the other hand, does binpacking do Cockroach Labs any good at all? The public clouds allow for elastic compute at different shapes; why not just ask the public clouds for VMs, perhaps via a tool like Terraform? Note that at the time we were researching Kubernetes Cockroach Labs ran a separate instance of CockroachDB per customer, in dedicated GCP projects or AWS accounts, since CockroachDB didn’t yet support multi-tenancy (now it does!). The database required at least 2 vcpu of compute per node to run well, and cloud providers provide machines of this shape on demand. In that world, binpacking does not reduce costs in any meaningful way.

It’s not just about binpacking

Google has written an excellent whitepaper on the history of Borg, Omega, and Kubernetes. This part is very relevant to this discussion:

“Over time it became clear that the benefits of containerization go beyond merely enabling higher levels of utilization. Containerization transforms the data center from being machine-oriented to being application-oriented.”

Having used (and loved) Borg, I felt this to be true deep in my bones. Operators describe an application’s requirements declaratively via a domain specific logic called GCL (incidentally, the language is a fever dream of surprising semantics), and Borg handles all the low-level details of requesting compute, loading binaries onto machines, and actually running the workload. If you want to update to a new version of a binary, an operator simply writes a new GCL file to Borg with the new requirements. In summary, Borg provides a wide variety of powerful automation primitives, which make software engineering and SRE teams efficient; they are able to focus on improving their services rather than doing repetitive and uninteresting service operation tasks. Wouldn’t Cockroach Labs benefit from access to solid out-of-the-box automation primitives also?

Kubernetes pros: easy automation and orchestration

The goal of CockroachDB Dedicated is to provide CockroachDB to customers without them having to run the database themselves. Distributed databases are complex; not all companies want to develop the expertise in house needed to run them reliably.

CockroachDB Dedicated allows customers to request the following operations to be run on their cluster via a convenient web UI:

Create a cluster running in this and that region with this much compute per node
Add or remove a node to a running cluster
Add or remove a region to a running cluster
Update the database version to the latest one

Let’s consider #4 in detail. Without Kubernetes, you could either build the automation yourself or shop around for some other automation tool.

Why is building it yourself a bad idea? Let’s consider the requirements. There is no more efficient way to break a piece of software than updating to a new version. This argues that the update must proceed gradually, updating one CockroachDB node at a time. If the database starts to fail (e.g. if health checks fail), the update must halt without operator involvement. It should also be possible to rollback quickly if there are unforeseen problems. At the same time, if the update often halts unnecessarily due to the flakiness of the automation tech, then oncall load increases. The above requirements are tricky to get right, and we have not even considering all the low level details involved in building this yourself, such as SSHing into machines to stage binaries and sending HTTP requests to do health checks.

Here’s how to do #4 with Kubernetes:

$ kubectl rolling-update crdb –image=CockroachDB:19.1.1

That’s it! Note that the speed of the rolling update is configurable, the automation is robust to transient failures, and it supports configurable health checks.

Similar arguments apply to #1, #2, and #3. In fact, the case for Kubernetes is even stronger in these cases, since safely changing the footprint of a stateful application that is serving customer traffic is not something automation technologies other than Kubernetes do particularly well.

Previously I asked the following question:

“Wouldn’t Cockroach Labs benefit from access to solid out-of-the-box automation primitives also?”

As we built prototypes, we felt the answer was definitely yes. On the other hand, the fact that access to automation primitives is a benefit of using Kubernetes doesn’t imply there aren’t also downsides. Remember the screaming masses on Hacker News please!

Kubernetes con: complexity costs

One of the main complaints about Kubernetes is that it is complicated. Let’s add some color to the claim that Kubernetes is complicated. In what senses is it complicated?

It’s a stateful service. It leverages etcd, a stateful distributed system, to store its state.
There’s a hierarchy of independent controllers that work together to automate tasks. The different controllers have pretty clear and separable functions but still it is initially hard to get your head around the dance that they are doing together. Julia Evans has written an excellent blog post describing the dance in detail.
In order to allow multiple pods to serve on the same port on one machine, Kubernetes must assign a unique IP to each pod. So the IPs of the VMs themselves are not used directly. According to , for GKE specifically, routing rules in GCP, unix networking configuration on the k8s nodes themselves, and the underlying VPC network work together to create the abstraction. Note that on other cloud providers, the model is the same, but the abstraction might be implemented differently.
If a user wants to run CockroachDB across multiple geographic regions, then multiple Kubernetes clusters must be joined together in some way. Former Cockroach Labs engineer Alex Robinson has written a
blog post about this exploring the options; none of them are simple.

1 and 2 are control plane only. That is, if etcd or the controllers break, we cannot add a node to a CockroachDB cluster running on Kubernetes, but the cluster will keep serving SQL queries all the same. Also, there are managed Kubernetes offering such as GKE that reduce how much Cockroach Labs SRE needs to know about etcd for example.

3 and 4 are data plane. These are the scary ones. If the networking stack breaks, the cluster will NOT keep serving SQL queries. SRE feels we must understand the networking stack deeply, if we are to depend on Kubernetes in production. This is no easy task, especially when already holding onto lots of cognitive load about CockroachDB itself.

Why CockroachDB chose Kubernetes

Let’s summarize the situation. Google has used container orchestration for a long time and is very successful with the technology. There are two main benefits:

Efficient use of compute, when you are running a suite of heterogeneous services, via binpacking.
Solid automation primitives, which allow software engineers and SREs to think in terms of applications, not machines.

#1 doesn’t benefit Cockroach Labs at all. #2 benefits Cockroach Labs a lot. On the other hand, Kubernetes is complicated, increasing cognitive load. Specifically, the networking stack is the main piece of complexity the SRE team worries about. What to do?

We felt, at the time, that the benefits of using Kubernetes outweighed the complexity costs. By using Kubernetes, we hoped to create a highly orchestrated managed service, which would free us up to do interesting work on the reliability and efficiency of CockroachDB itself. There are also some benefits to Kubernetes that we haven’t covered in detail here, such as the fact that it provides a cross-cloud abstraction (CockroachDB Dedicated is already available for GCP and AWS), and the fact that many of our existing enterprise customers (not Managed CockroachDB customers) use it to deploy CockroachDB to their private datacenters.

A parting thought

Why is Kubernetes networking so complicated? One reason is the requirement that each pod has its own IP address, which is needed for efficient binpacking (without requiring application level changes from Kubernetes users). Funnily enough, this requirement didn’t do CockroachDB Dedicated any good at the time we adopted it, because we always ran a single CockroachDB instance on each VM. This makes me wonder how many users choose Kubernetes for binpacking vs. for automation primitives vs. for both. In this brave new world of cheap public cloud elastic compute and managed Kubernetes offerings, do many users choose Kubernetes only for the automation primitives? If so, could the networking stack be made even simpler for those users?