About 75 percent of container orchestration is done in Kubernetes. But the popularity of the k8s platform doesn’t mean it’s easy to use in all scenarios. Kubernetes can have a steep learning curve, especially when it comes to managing state and storage. In a recent episode of “The Cockroach Hour”, Director of Product Marketing Jim Walker sat down with Keith McClellan, one of Cockroach’s Solutions Engineers, to chat about the storage and data challenges that you’ll encounter deploying Kubernetes, and how running it with CockroachDB can simplify those challenges.
Keith used to work at D2iQ–back when it was still called Mesosphere–and kicked off the conversation with a discussion of how Kubernetes came to be, and why it beat out most other container orchestration platforms. D2iQ’s container orchestration scheduling platform is based on Google’s Borg structure, which predates Kubernetes. But in 2015, D2iQ started contributing to the Kubernetes project before it was released as an open source platform, which is where Keith’s experience with the project kicked off.
In that time, Kubernetes quickly emerged as the top container orchestration platform. Other tools on the market were mostly operator-focused, but Kubernetes won out by making it easy for developers to deploy their applications to a cluster of machines. The other platforms are suited for operators taking their containerized apps and deploying them to a cluster of machines. That’s a fine distinction, but it matters a lot when you’re talking about how long it’s going to take you to get an application up and out and in front of people.
Moving to Kubernetes from one of the other platforms required a paradigm shift into distributed systems. For Keith, who started as an ETL developer and had run into scheduling issues with the ETL cluster before Kubernetes even existed, it was great to see tools to assign workloads to cluster machines and make them production-supportable. But the challenges working with large distributed database environments and data pipelining environments led Keith to the infrastructure orchestration space.
When Keith was at D2iQ, one of its products used CockroachDB as the metadata layer. He could see that the next thing that needed to be fixed in the distributed computing landscape was the ability to do system of record workloads in this type of environment.
Storage in a pod a la Kubernetes is ephemeral storage – basically, a temporary file system. When you provision storage to a disk that’s local to that machine, to that pod, it lives and dies with the pod. It sticks around long enough that if something really bad happened and you need to get that state back, you have the opportunity to do that, but there’s no way to actively reattach that state to a new pod.
One of the biggest challenges of Kubernetes has come around maintaining state and then maintaining consistency of that state. Arguably this is prior to Kubernetes 1.13, but persistent volumes and StatefulSets were available back to 1.8 in an alpha beta type stage. But prior to that, if your pod had some ephemeral storage, if you didn’t attach external storage to that pod, you didn’t really have state that persisted across the pod restart. As a result, there were a lot of things that needed to come to the Kubernetes ecosystem. One of the best parts of Kubernetes is one of the worst parts of Kubernetes, because it was designed to be super easy for developers. It’s a little harder to operate than some of the alternatives, and one of those things that was harder to operate was bringing in this stateful storage layer.
In the 1.8 to 1.13 timeframe, they brought this concept of persistent volumes to Kubernetes, which was this idea of having durable storage available to the local node that the pod was running on, be available to the container or the pod without you having to do any kind of manual manipulations to mount file systems into the pod or any of that kind of stuff. That allows you to have more stateful applications. However, the big challenge with the persistent volumes is that they have to be manually provisioned ahead of time. They have to be an available resource that an operator has set up inside of Kubernetes. Each node would have to have a set of persistent volumes set up, and they’d all have to be mounted on that machine, and available to the Kubernetes daemons that were running on that machine, and then they could be reserved.
Storage classes were designed to make this a little bit easier. Instead of directly attaching to a persistent volume, it lets nodes request storage with particular types of mounts, e.g., read once/write once or persistent volumes. Then the systems figure out how to mount it in. But there still are challenges with running persistent stores on top of a Kubernetes based ecosystem. The vast majority of persistent volumes are some type of direct attached storage, which isn’t portable. This means that once a pod with a particular persistent volume is started on a node, you can’t really move that pod without an operator intervention. They’re remote storage options that allow you to have some more storage portability. But those aren’t necessarily things you’re going to be guaranteed to have in every Kubernetes cluster.
If a pod fails, and there’s a persistent volume, then that volume won’t be garbage collected unless an administrator goes in and deletes the PV. By default, the timeout for ephemeral storage is either 12 or 24 hours, depending on the distribution in Kubernetes you’re running. But for a stateful workload, you wouldn’t use that. That’s used for collecting logs from pods that are cyclically failing, but not really what we would use storage for.
A Kubernetes StatefulSet allows you to bind a persistent volume claim to a kind of a clone of a pod. Rather than every time that a pod restarts, having to manually reclaim that persistent volume, the StatefulSet maintains the claim even if the pod restarts, and manages the process of automatically remounting it into that pod. So if we’re doing a rolling restart to do a software upgrade, we don’t lose our storage. We don’t lose the persistent volume during a pod restart, to change the binary out, or do a security patch, or restart a server because we had to do an OS patch.
StatefulSets are used beyond storage. It’s largely the configuration that binds the state to the pod. Anything that needs that kind of criteria gets recorded in the metadata of the StatefulSet. As far as exactly which pieces of the system are currently leveraging that infrastructure, it changes every Kubernetes release. Every time there’s a new bit of state that needs to be attached, they just add it to the StatefulSet. Without StatefulSets, you would see more ephemeral patterns. For example, if you have a microservice that needs a certificate, you’d need to use something like Vault. Now, Kubernetes has its own built in certificate management, so it gets easier to manage. But figuring out where that metadata needs to be stored in Kubernetes becomes a little bit more complicated.
DaemonSets and StatefulSets are different. A StatefulSet will run an arbitrary number of pods across an arbitrary number of servers. A Daemon set will run a clone of a particular pod on every server in a Kubernetes cluster. And unless you have set, what in the Kubernetes ecosystem is called a taint, unless you’ve tagged a node to not get a copy of that daemon, you’re going to run a copy of this pod on every node in your cluster. Largely it’s used for things like security agents, and sometimes you use it for east, west load balancing. Sometimes you use it for other types of admin tools, like if you have a log stash agent that needs to run on every single node, that’s what you use it for. It is an option for running stateful applications. The daemon set has the same underlying constructs as a StatefulSet, as far as maintaining an attaching state. It works a little bit differently than a StatefulSet, but you can attach a persistent volume claim as a part of the daemon set, which allows you to do a lot of the same things you can do with a StatefulSet.
You can use DaemonSets with CockroachDB. There are good reasons to do so; a lot of times it depends on the type of infrastructure that you’re running your Kubernetes cluster on. So, because we can scale both horizontally and vertically, so we can either scale our individual pods up to consumer resources, or we can scale out more pods to increase our resource utilization, it depends on your workload how you’d choose to be more distributed. If you’re doing a lot of data pipelining, having a local gateway to your database on every single one of your nodes can be a really valuable thing. But Cockroach supports both patterns depending on your use case.
Unless you’re running two separate CockroachDB clusters, you don’t need to run a StatefulSet and a DaemonSet at the same time. You might want to do that, if you’re sharing production and pre-production on the same Kubernetes cluster. In that case, you might want to run your production environment on a daemon set, and you might want to run a pre-production or a test environment on a smaller subset of the nodes. But in a general purpose sense, that would be a very advanced kind of edge case for this.
The only thing a developer really needs to be concerned about is whether or not they have state that needs to persist. If they have state that needs to persist, then they need to care about StatefulSets, or some other mechanism for preserving that state across pod restarts. It’s mostly there to make it easy to operate these types of workloads in production.
From a developer perspective, it’s 10 lines of YAML max, in your pod configuration. On the backend, there’s configuring storage classes, and or configuring persistent volumes, and managing claims and all that kind of stuff. But from a developer perspective, it’s a very easy pattern to use.
It’s actually easier for the most part to operate CockroachDB on Kubernetes](https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html) than it is on bare metal or on statically provisioned VMs, largely because CockroachDB is a single executable that offers a common gateway to the database through every node. Every node is exactly the same. The only difference is which kind of segments of the data that that particular node happens to be managing. So if you have an intermittent failure or you’re doing a rolling upgrade, you really want that infrastructure to come back up as quickly as possible. If you’re on bare metal you’ve got to write a script or system service to make sure that the database restarts. There’s a lot of operations work that just goes away with Kubernetes.
Yet there are cases where a bare metal or a virtual VM based install is recommended for a customer over a Kubernetes based install, and there are reasons why Kubernetes might be the absolute best platform for somebody. But there are even cases where customers are running across three or more data centers, which is something that’s very common, to run a single database cluster across multiple different environments. Having the common operating layer of Kubernetes makes it much easier to manage the fact that you might not have the exact same infrastructure under the covers.
There’s also a tradeoff between this resiliency that you get with Kubernetes and the performance you get with bare metal. In this scenario, best case performance is going to be better on bare metal than it is going to be on Kubernetes, even though it requires more operational overhead to maintain the database on bare metal. Then it becomes a question of meeting transactional requirements in a Kubernetes based environment. In that case, CockroachDB can drastically reduce operational overhead by being in Kubernetes](https://www.cockroachlabs.com/guides/kubernetes-statefulsets/).
Q: How do you choose when to mount something, versus when do you use local storage?
A: Generally speaking, I worry about mounting persistent storage if they’re is state that I know needs to survive a pod restart of some sort. If I don’t care about state surviving a pod restart, then I’m going to use ephemeral storage, because that’s going to allow me the maximum pod portability. As soon as I start doing persistent volume claims and whatnot, that changes the mobility of those pods a little bit, and so you’ve got to pay more attention as an operator. If I can get away with not mounting storage, I’m going to not mount storage.
Q: Do you have any tips and tricks on editing and managing YAML files?
A: Make sure you use VI and you don’t try to use TextEdit on your Mac, because it’s going to mess up all of your spacing. There’s a tool that you can get on your Mac or in Linux called YAMLlint.
Q: Why federate the operations of the cluster? Why not just federate the data? And so, could CockroachDB be used as that tool to federate workloads across multiple clusters?
A: There are some exceptions to this, but if your application is largely stateless and it uses a database to maintain state, then CockroachDB could do that. Then you don’t have to connect your control planes, Kubernetes being your control plane. You still do have to do network pairing, which is not the easiest thing to do in Kubernetes. Service mesh is not necessarily the best because it’s not designed for point to point communications. It’s designed to load balance incoming requests across a number of clones of a particular pod. So you might use Istio for managing the stateless side of your application. Some other form of network pairing may need to be used because every node in the database needs to be able to talk to every other node in the database. Not because this will happen all that much, but in a failure scenario sometimes the database can’t just talk to the next nearest neighbor that happens to be hosting a copy of the data to get to a consensus.