The first version of CockroachCloud, our database-as-a-service product, had our users fill out a Google doc with their cloud deployment preferences. We’ve come a long way since that initial proof of concept (including, yes, building a UI). More importantly, we’ve put implementation patterns in place that make something complicated (like configuring a cloud deployment) scalable. As we’ve written about before, choosing to run CockroachCloud on Kubernetes is a huge part of that implementation pattern.
In a recent episode of The Cockroach Hour, CockroachCloud SREs Juan Leon and Josh Imhoff sat down to talk about some of the tools and processes we use to make that happen. While some of the information is specific to CockroachCloud, many of the implementation patterns covered in the webinar are applicable to anyone considering running their SaaS product on Kubernetes. In this blog post, we’ll recap some of those lessons, from the unexpected benefits of running on Kubernetes to the provisioning and certification platforms we use alongside it.
About a year ago, Josh wrote a blog about why the SRE team started running CockroachCloud on Kubernetes. There are a lot of different benefits to Kubernetes. Early on, we really cared primarily about the automation and orchestration abilities (more on that, below). But as time goes on, we’ve found more and more reasons to love K8s.
CockroachCloud is a managed database as a service, and we needed our customers to be able to scale their CockroachDB clusters up and down at the drop of a hat. Therefore, the powerful automation primitives inherent in Kubernetes was really top of mind for us.
When we were just starting CockroachCloud, we did this workflow with Terraform. Terraform spun up the VMs for us, and then CockroachDB on those VMs with supervisor scripts and a touch of homegrown automation. And the benefits of that was it was very simple. You just have the VM and then just Cockroach process running. That's it, which the SRE team liked. There's not tons of moving pieces they got to understand. But the cost was high: it was rather hard to do orchestration tasks.
But we knew we needed to automate these tasks--while maintaining a high level of reliability--and we didn't want to build that ourselves. Kubernetes was the best way to do that, and does an excellent job.
Another main benefit to deploying our software as a service on Kubernetes is its bin packing capabilities. Kubernetes lets you pack a bunch of containers on a VM in a way that improves resource utilization. This is one of the motivating reasons for Google's Borg (a tool both Josh and Juan had used previously). Google realized how much resources they could save by bin packing containers through virtual machines.
When we were initially building CockroachCloud, we didn't care about bin packing at all. We just wanted to run one Cockroach node per VM, because we wanted the whole VM for CockroachDB. We didn't even need that benefit of Kubernetes. And there's a lot of complexity that comes from those two requirements being implemented by Kubernetes and we were taking on that complexity in order to get only some of the benefits of Kubernetes almost.
Kubernetes Provides a Common Interface
Another big benefit we didn't totally expect when first using Kubernetes is the simplicity a common interface offers. Right now, CockroachCloud runs on GCP and AWS, and we have plans to expand. Kubernetes offers a consistent way of running production across clouds. And that's powerful.
As far as monitoring goes, we run Prometheus clusters, Prometheus Alertmanager and Grafana in each customer cluster on Kubernetes. So in a given Kubernetes cluster we have dedicated to customer A, we're running Prometheus Alert Manager and Grafana. If it's a multi-region cluster, each region's running Prometheus and they're scraping all the regions, so it's replicated. Then we have a meta-monitoring Prometheus instance that’s just sitting outside of the cluster. It's going to page if the problems for the customer abound.
We also needed to automate certs when scaling CockroachCloud. We use HashiCorp’s Vault (and Kubernetes) for that. Vault is a really powerful secret store. If you have a secret that you want to just store in a specific place, just write a key to Vault, and Vault will keep it and handle encryption for you. It also serves as our certificate authenticator, and manages and distributes project owner tokens.
We run Vault in a control plane that's used to manage all the clusters. That way, Vault's not in the serving path. Rhwn, we use secrets within each instance to manage the certificate. Alternatively, we could have had CockroachDB talk directly to Vault, but that would introduce security issues. Because then the customer cluster has access to this thing that has keys for all the customers. Vault writes things to Kubernetes, but the customer’s cluster--since it’s accessible by the customer--cannot talk to Vault.
We were initially depending on Terraform to provision hardware. But again, we had this future in our heads of service that creates clusters and adds regions to existing clusters and all of that. And we didn't really want our code to be generating Terraform configs, that then get executed by Terraform. We just thought that was a little messy. So we looked at Pulumi, which is very core to Terraform, except that you can configure it with Go code. We’re a Go shop, which made it a great fit for us.
Your implementation patterns will no doubt vary from CockroachDB’s. But the lessons Josh, Juan, and Jim cover in this webinar are relevant to anyone running--or considering running--their software-as-a-service on Kubernetes. For more details on gotchas and lessons learned (plus a quick conversation about what Kubernetes can do it a serverless world), watch the full webinar here.