Gotchas and solutions running a distributed system across Kubernetes clusters

```

I recently gave a talk at KubeCon North America -- “Experience Report: Running a Distributed System Across Kubernetes Clusters”. Below is a blog based on that talk for those who prefer to read rather than listen. For anyone interested in viewing the talk, it is available here.

```

If you have run Kubernetes clusters recently, you've probably found that it's a nice way to run your distributed applications. It makes it easy to run even pretty complicated applications like a distributed system.

And importantly, it's been drastically improving over the years. New features like dynamic volume provisioning, StatefulSets, and multi-zone clusters have made it much easier to run reliable stateful services. Community innovations like Helm charts have been great for people like me who want to make it easy for other people to run an application they develop on Kubernetes. And for end users, the increasing number of managed Kubernetes services these days make it so that you don't have to run your own cluster.

However, the situation hasn't really improved if you want your service to span across multiple regions or multiple Kubernetes clusters. There have been early efforts, such as the Ubernetes project, and the recent Federation v2 project is still ongoing, but nothing has yet solved the problem of running a distributed system that spans multiple clusters. It's still a very hard experience that isn't really documented.

Why run a multi-region cluster?

You might ask why you would want to run a system across multiple regions, and that maybe there's a reason that this work hasn't been done by the community. I'd argue there are a few big reasons:

User latency. The further away your service is from your users, the longer it takes for them to receive a response. This is especially critical if you're running a global service that has users all over the world. If you don't have any instances of your service near them, they're going to have to wait hundreds of milliseconds to get responses back from your service. This is a very strong reason for wanting to run across multiple regions.
Fault tolerance also becomes very important. Your application should continue running even if an entire data center or an entire region of data centers goes offline for some terrible reason. In this case, if the data center over in California goes down, the users on the West Coast might have to wait a little longer for their responses, but they'll still be able to use your service because of the other replicas that are stood up on the other parts of the country.
Bureaucratic reasons. There are laws in many countries now around where you're allowed to store users' data without their permission. You might have to store the data from users in China in China, and you might need to store the data from users in Russia in Russia, and having a system that spans multiple regions enables you to do that. If you only have a data center in the US, then you're going to have a tough time.

In this blog, I want to talk about the practical experience of running a distributed system across multiple Kubernetes clusters and what it's really like. I’m going to cover:

Why it's hard
What kinds of detailed information about Kubernetes or about your distributed system you should know before you get started
What solutions are available, and which solutions we've put into production and seen work well for us.

Why it’s hard

The question of why it's hard is something that many people don't even think to ask. I've had a number of conversations with folks where they'll assume that since CockroachDB runs really well across regions and because Kubernetes makes it really easy to run CockroachDB, that it's just naturally going to be easy to run CockroachDB across multiple Kubernetes clusters spanning regions. In fact it's not easy at all.

You might have noticed that so far I've been using the terms "multi-region" and "multi-cluster" essentially interchangeably. Kubernetes is not designed to support a single cluster that spans multiple regions on the wide area network. For quite a while, it wasn't even recommended to have a single cluster span multiple availability zones within a region. The community fought for that capability, and now it is a recommended configuration called a "multi-zone cluster".

But running a single Kubernetes cluster that spans regions is definitely done at your own risk. I don't know anyone who would recommend it. So I'm going to keep using these terms - “multi-cluster” and “multi-region” - mostly interchangeably. If you want to run something like CockroachDB across multiple regions, you are necessarily going to have multiple Kubernetes clusters, at least one in each region.

To better understand the problem of setting up a system like CockroachDB across multiple regions, it helps to understand: What does a system like CockroachDB, or any other stateful service, need to run reliably in the first place?

For this blog post, I will be using CockroachDB as the example because it’s the system that I'm most familiar with running in Kubernetes.

Cockroach requires roughly three things in order to run reliably.

Each Cockroach node must have persistent storage that will survive across restarts.
Each node can communicate directly with each other node.
Each node must have a network address that it can advertise to the other nodes so that they can talk to each other reliably. That address ideally should survive restarts so that you're not always changing addresses. (We can be flexible with this one. CockroachDB will work if your nodes are changing network addresses, but this is something that most stateful systems need, and so I will leave it for the sake of completeness.)

Many stateful systems need these three things, and Kubernetes provides these all very well within the context of a single cluster. The controller, known as a StatefulSet, was designed to provide exactly these three things because they're so commonly needed.

The Kubernetes StatefulSet gives each node its own disk or multiple disks as well as its own address. Kubernetes itself guarantees that all pods can communicate directly with each other. But across multiple Kubernetes clusters, we lose requirement #3 above: there's no cross-cluster addressing system. We often lose requirement #2, as well: It's often not set up such that a pod in one cluster can talk directly to a pod in another cluster. I call this “pod-to-pod connectivity” later in this post.

So the core problems of running a multi-region system all come down to the networking.

You might lose pod-to-pod communication across clusters. You might have to communicate across separate private networks in order to talk from one cluster to another.

For example, if you're on AWS, each Kubernetes cluster in a different region is going to be in a different VPC, so you will have to connect those with some sort of VPC peering or a VPN. This is going to be even more true if you're spanning multiple cloud providers.

Finally, we also need a persistent address that works both within and between the clusters. One option is to have two separate addresses -- one that works within the local cluster and one that works between them. This works as long as your system can handle having separate addresses. While this works for CockroachDB, it wouldn't work quite so well for other distributed systems that assume just one address per instance.

Why is networking the problem?

Kubernetes specifies a list of requirements from the network that it runs on. Each pod running in a cluster has to be given its own IP address. The IP address that the pod itself sees inside its container must be the same IP address that other pods in the network see it as. So a pod running ifconfig should see the same IP address as any other pod that it connects to. All this communication should happen directly, without any form of NAT.

This poses a real problem because most users want to run a lot of pods on each machine, but traditional networks only allocate one or two IP addresses to each host in a cluster. So extra work is needed to allocate new routable IP addresses for each of the pods on these machines.

As Kubernetes has grown in popularity, dozens of solutions have been built over the last few years to satisfy these requirements. Each has details that differ from each other, making it difficult to rely on any particular implementation details to be true for any given Kubernetes cluster.

In all of this, the multi-cluster scenario was left completely unspecified, making all those little implementation details of the different networking solutions really critical because most solutions aren't trying to provide any sort of guarantee about how multi-cluster works, so the way that multi-cluster works is completely left up to the unintentional implementation details.

This makes using multi-cluster a bit of a minefield because you won't know exactly how the network underneath you works unless you know exactly which network solution the cluster is using. There are a lot of networking solutions out there — more than 20 are listed in the Kubernetes networking docs alone. They all work differently, so there's only so much in common that you can depend on, and that means that the way you make multi-cluster networking work will have to depend on your networking provider. This is really frustrating from my perspective where I'm trying to make configurations that everyone can use without needing to be Kubernetes experts.

We can group the numerous networking solutions into three high-level categories:

Those that enable direct pod-to-pod communication across clusters out of the box. (Includes Google Kubernetes Engine and Amazon's Elastic Kubernetes Service)
Those that make it easy to enable. (Includes Azure Kubernetes Service \[“advanced networking mode”] and Cilium \["cluster mesh"])
Those that still make it quite difficult (I won’t name names).

Solutions: What can you do?

This is a tough problem. With so many networking providers being used under the hood, it's unlikely that there's one single solution, but I will explore a few of them, and the problems they may still present.

Note: I did most of my work on this in the first half of the year, so it's possible some development has happened here that I haven't heard about that gets rid of some of these cons. I've been keeping up and trying to play with things every once in a while, and I haven't seen a great solution that's fully mature yet. Work is being done on it, though, which we'll see later.

Some of the solutions:

Don't work in certain environments.
Break very easily if the operator of the clusters isn't very careful about what they're doing.
Use a slower data path, giving you a less performance system.
Don't work well with TLS certificates, so you might have to run your cluster in what we at Cockroach call "insecure mode", where you aren't relying on certificates to secure all communication between nodes.
Rely on an immature or very complex system that isn't quite ready to be used in production yet, even if it's on the right track.

Note: For most of these solutions, I'm assuming that in each Kubernetes cluster, you're just going to run a StatefulSet. I'm then focusing on how to connect the pods from each cluster together. Each StatefulSet can be managed independently, meaning that if you wanted to scale up the nodes that you have running in one Kubernetes cluster, you would just scale the StatefulSet that's running there.

Solution #1: Static pod placement + HostNetwork

One solution is to pretend that you're not really using Kubernetes at all.

Kubernetes allows you to statically place pods on specific nodes, or you can use a DaemonSet to put one pod on every node. You can then use the HostNetwork option in the Kubernetes configuration file to tell the pod to use the host machine's network directly. The pod will then be able to use the host machine's IP address or hostname for all of its communication.

This means that your pod will be using the ports directly on the host's network, so you can't have more than one Cockroach pod on that machine because you're going to run into port conflicts if you do. However, you can use that host's IP address to communicate between nodes in different clusters. As long as you've set up your network such that your VMs or your physical machines in the different clusters can directly talk to each other, then Cockroach pods on those machines will be able to directly talk to each other.

In order to avoid changing addresses all the time and potentially getting some sort of split brain situation, you're probably going to want to statically assign these pods to machines. Your options are to say directly in the config file which host you want each pod to run on, or to use the DaemonSet such that you're just always running one Cockroach instance on each machine, keeping their IPs the same.

This is great because it works even if you don't have that pod-to-pod IP address communication between clusters. You're relying on the machines in each cluster having direct IP routability, and this has no moving parts. No need for any cross-cluster naming service. You'll just have to set up your join flags when you start nodes to include some of the machine's IP addresses.

Another perk of this configuration is that it makes it very natural to use each machine's local disk for storage, so you don't have to deal with creating and managing remotely attached persistent volumes.

One big caveat with the above is that it depends on your host's IP addresses not changing.

This is something to be especially careful about if you’re running on a public cloud. If you're using a hosted solution like GKE or EKS, they do node software upgrades VM by just tearing down each VM and recreating it. When it gets recreated, it's very likely to get a different IP address. Because join flags rely on those IP addresses staying the same, this could lead to a new cluster or nodes being started up that are not able to connect back to the cluster after being restarted during these upgrades.

Note: You want to be really careful on GKE, which can sometimes do automatic node upgrades if you don't configure it not to.

A workaround is to orchestrate your own custom upgrade process, or to modify the IP addresses in the join flags while the upgrades are happening.

I have run this in test environments, and it works well because you get high performance because the nodes are just talking directly to each other. Once you set it up, you don't have to think much about it, other than during those upgrade scenarios.

It requires a bit of manual work to edit the configuration files to put in the join IP addresses, and, assuming you want to run in secure mode, to create the certificates for each IP address. But overall, you may get some extra reliability by doing it this way because any sort of dynamic movement that Kubernetes would otherwise do to move your pods around and have to detach and reattach disks becomes a non-issue because none of that is happening here. For those who would argue that they don't trust Kubernetes to run their stateful services, they must trust something like this because Kubernetes is prevented from getting in the way of things.

Solution #2: Use an external load balancer for each pod

This requires running a StatefulSet in each cluster, and then creating a public load balancer for each CockroachDB pod. It is really easy to do in all of the major cloud providers because they all have plug-ins for Kubernetes that make it easy to expose a service from within a cluster, giving a separate load balanced address to each of the Cockroach nodes.

For example: If we have three Cockroach nodes in our cluster, we'll get a separate load-balanced address for cockroachdb-0, cockroachdb-1, and cockroachdb-2. We take those addresses from the load balancers and have all the pods connect to each other using those addresses. We then create certificates for all the nodes using those addresses. Since those load balancer addresses are stable, the underlying CockroachDB pods can be deleted and recreated, moved from one node to another, and so on, and the load balanced addresses will always route to the correct Cockroach pod.

Solution #2: The Pros

We have had customers using this for many months. It works even without pod-to-pod communication between clusters, as in the previous solution, because all communication between pods that are in different clusters is going through that load balanced IP address.

It continues working even as Kubernetes hosts churn in and out of the cluster.
It avoids the problems of the previous solution because even if nodes are being deleted and recreated all the time in the cloud, the pods are still using the same load balanced IP addresses that never change.
It doesn't require you to add any extra moving parts to your system, other than the load balanced IPs. You don't have to run some service that allows the Cockroach pods in one cluster to be able to find the Cockroach pods in another cluster because you just have those load balancers that are presumably being run by your cloud provider.

Solution #2: The Cons

However, there are some problems with this option. Most notably, it is hard to configure and can become expensive.

It can be a little slower than the previous solution, because on all but the most sophisticated load balancers, when you send a packet to that load balanced IP address, the load balancer will forward it to any random host in the other Kubernetes cluster. Then that other host has to look up where the pod is actually running and forward the packet to that host.

Note: Some providers are smarter. I believe the load balancer integration of Google Cloud is smart enough now that the load balancer will send the packet directly to where the pod is running. However, this is not going to be true everywhere, so you might see slightly worse performance going this route.

It can be a pain to have to create a new load balancer every time you want to add a new node to your cluster. If you forget, which can be easy to do because you might just run the kubectl scale command, you'll see that your new pod isn't working because it doesn't have an IP address already configured for it. So you have to remember to do that.
It can be expensive to run so many load balancers. This will again depend on the cloud, but creating this many load balancers on the major cloud providers could be quite costly. You might also not be able to do this on-prem if you don't have good load balancer integration between your Kubernetes cluster and some sort of manually-run load balancer like HAProxy.
It requires a lot of manual config file editing and certificate creating in order to create the complex configuration needed. Once you've set it up, it's going to be very reliable. I haven't heard of any outages caused by this sort of configuration. Just some hiccups while adding a new node to the cluster, but that didn't affect the currently-running nodes, thankfully.

Solution #3: Use pod IPs directly

If you have direct communication between pods in different clusters, you can let the pods advertise their internal pod IP addresses to each other directly. As pods get deleted and recreated on a new node, they might come back up with a different IP address. Some systems, for instance CockroachDB, are able to actually handle this, as long as the disk is moving around with the pod.

You'll also need to set up some sort of load balancer that can be used for the join flag. You need the join flag to have some sort of address that will always be there, but luckily that join flag doesn't get used on the data path, only during initialization. This makes the solution quite fast, with the pods communicating directly with each other across the different clusters.

Solution #3: The Pros

This is very fast. There's no overhead on the data path at all, other than the normal Docker overhead of bringing packets into a container and back out of it again.
This is very resilient to changes. You can add or move hosts as much as you want.
It requires very little work to configure and maintain.

Solution #3: The Cons

The real caveat with this setup is that it can be difficult to do in secure mode. Because IP addresses can change across pod deletions / recreations, creating TLS certificates that can stand up to hostname verification is tricky. We have only run it as a proof of concept in insecure mode. You would need some sort of dynamic certificate creating and signing solution in order to create new certificates any time a pod is given a different IP address than it had before.

So while this has worked really well for insecure deployments, and it's really fast and easy to maintain, setting it up for secure deployments is tough. 5

Solution #4: DNS chaining

This is what we have documented in our docs, and it's what I would call "DNS chaining". Basically we're taking the previous solution where pods are talking directly to each other on their IP addresses, which is great for performance, and adds persistent addresses to that solution. Specifically, we want to add persistent network addresses that will work even as the underlying pods' IP addresses change, which will allow us to put those persistent addresses into the certificates for each pod, and then to run in secure mode.

To do this, we want to configure the built-in DNS server within each Kubernetes cluster to serve DNS entries from the other clusters. The question, of course, is how, because kube-dns is set up to only create DNS names for the services in its own cluster.

DNS chaining: Use CoreDNS instead of the default kube-dns service

One way to do this is to use CoreDNS instead of the default kube-dns. CoreDNS is a much more fully featured DNS server that allows for more customization. It has a full plug-in architecture. There's even an existing plug-in called kubernetai, which was written to allow one CoreDNS server to watch the API server of multiple Kubernetes clusters. With kubernetai, we could set up CoreDNS in each cluster to watch the services in all of our Kubernetes clusters, and then kubernetai would automatically create service names for the services in the other clusters.

However, in practice swapping CoreDNS in for kube-dns on any of the managed Kubernetes offerings is very difficult. It's also difficult to modify each cluster's DNS domain. You'll notice if you look at the full service addresses of Kubernetes services, they all end with .svc.cluster.local (unless you've already customized it yourself). Ideally you would want each cluster to have a separate DNS domain so that it would be easier to distinguish between a service running in one cluster and a service running in another cluster so you don't run into conflicts between them. Earlier this year, I tried for a couple days to replace kube-dns or to set up a different cluster DNS domain on GKE and had no luck.

With CoreDNS becoming the standard DNS server as of 1.13, this may get more feasible. Setting up a custom DNS domain might still be very difficult, so there would still be that roadblock. The kubernetai plug-in that we would want to use also doesn't come in the default CoreDNS Docker image that gets used, so there would still be some difficulties in distributing the plugin, but this is becoming more feasible as CoreDNS becomes the standard.

DNS chaining: Use CoreDNS alongside the default kube-dns service

Another alternative which I haven't fully explored is to run CoreDNS alongside the default kube-dns service. We could configure kube-dns to defer certain look-ups to CoreDNS, and then have our own CoreDNS that's configured exactly the way that we want it running in each cluster watching the other Kubernetes clusters. We could then use some of CoreDNS's plug-ins or the rewrite rules to make the cross-cluster look-ups work out. I haven't fully tried this, and I haven't heard of anyone else that's actually doing it. So if you're going to try this out, you'll be in uncharted territory, but it should be feasible.

DNS chaining: Chain DNS servers together using “stub domains”

What we do for systems like GKE and DKS is use what are called "stub domains". Stub domains are a kube-dns config feature that allow you to defer DNS look-ups for certain domains to the name server of your choice. It allows us to configure a suffix that enables the kube-dns in one cluster to redirect look-ups for anything that ends in that suffix to a different cluster's DNS service.

For example, if you have a cluster in us-west-1b and you do a lookup for a DNS address that ends in us-east-1b.svc.cluster.local, then kube-dns will know to send that along to the DNS server that's running in that us-east-1b cluster. I've put together some scripts for this, and it’s documented on our website for people to try.

“Stub domains”: The Pros

There's no overhead on the data path other than the normal Docker overhead, which is tough to avoid. It's very resilient to hosts being edited or removed.
There's no need to configure cross-cluster naming service. You're just using the built-in DNS, and there's actually no manual work needed to get this running.

We have a script in the CockroachDB GitHub repository that requires you to fill in a few parameters around where your clusters are running and what their kubeconfig names are. You can just run the script and in a couple minutes you'll have a functional multi-region Cockroach cluster in Kubernetes.

You need to have pod-to-pod communication across clusters for this to work, and you need to set up some way for the DNS server in one cluster to reliably reach the DNS cluster in another server. So you can do this using an internal load balancer on the various clouds, or you could use an external load balancer, or you could come up with your own solution for that. It's not great.

On GCE today you have to use an external load balancer because their internal load balancers don't work across regions. I believe they're working on fixing that, but I don't know when a solution will be around, and so that would be exposing your DNS server to the public internet, which may have security repercussions depending on your cluster setup.

You could put firewalls in place to keep that from being as much of an issue, but it's something to definitely be concerned about. You might end up needing to run your own load balancer to enable these DNS servers to find each other on the network, and at that point, maybe you should just use that load balancer to handle the pod-to-pod communication.

Solution #5: Istio

Istio is a project that has become popular over the last year or two. They've been working on a multi-cluster mode to handle the problem of addressing across clusters for the last half year or so.

Note: This description reflects the state of multi-cluster support in Istio version 1.0, which was released in July 2018. The 1.1 release that should be coming out in early 2019 takes some massive steps forward toward solving these problems.

Istio explicitly punts on the problem of enabling pod-to-pod communication. All it does right now is naming. You install the Istio control plane in one primary Kubernetes cluster, and then you have to install Istio’s remote component in all of your other clusters. It has a very small overhead because packets go through the envoy proxy — an efficient, low-level proxy written in C++ that's in use at a number of companies. Like past solutions, it's very resilient to hosts being edited or removed.

The cons, however, are still pretty heavy. The entire control planning runs in a single Kubernetes cluster, so if that region goes down, you've lost your connectivity between clusters. This leaves a single point of failure, and that hasn't been resolved yet. You also have to solve the problem of the Istio components' IP addresses potentially changing as they're restarted, which is the same problem we're trying to solve for Cockroach in the first place.

While Istio is super promising and hopefully will solve the problem in the next year, as far as I can tell, it's pretty far from being suitable for production use. It's under active development and it looks like it will improve quite a lot very soon, though.

Conclusion

Multi-cluster networking is pretty tricky. It's much more complicated than I would have expected or than most people would expect, and unfortunately not a whole lot has been done about it for years. This past year is the first I've seen a lot of good movement in the right direction. And even after you nail pod-to-pod connectivity, you still have to solve the problem of providing stable names that work across clusters.

I am really excited about the many different projects like Cilium, Istio, and Crossplane that have spun up that are starting to care about it. Although it's hard to recommend a single answer for everyone, out of these few solutions there's definitely at least one of them which would be very reliable for you if you're willing to spend some time upfront on setup.

If you want to learn more about Kubernetes, visit here, where we’ve aggregated tutorials, videos, and guides.

And if building and automating distributed systems puts a spring in your step, we're hiring! Check out our open positions here.