How to use Cluster Mesh for multi-region Kubernetes pod communication

Thanks to services provided by AWS, GCP, and Azure it’s become relatively easy to develop applications that span multiple regions. This is great because slow apps kill businesses. There is one common problem with these applications: they are not supported by multi-region database architecture.

CockroachDB is built to solve that problem and we’re doing it in production for many applications today. But that’s not what this blog is about. In this blog, I will provide a solution for the problem of getting Kubernetes pods to talk to each other in multi-region deployments.

The challenge of Kubernetes pod communication across regions

Let me backup for a second: CockroachDB is often deployed inside Kubernetes. This is because CockroachDB is a single binary which makes it ideal to run inside a container and in Kubernetes (you can read more about that here). However, when we look to deploy CockroachDB across multiple regions or cloud providers this presents us with some challenges. Kubernetes is designed to provide a pod network for our containerized workload to communicate with each other. This network is typically not exposed outside of the Kubernetes cluster. This is a problem because all of the CockroachDB pods need to talk to each other.

A typical deployment pattern is for each region to have its own Kubernetes cluster. This makes perfect sense to ensure that latency is kept to a minimum and worker nodes do not become ‘islanded’ if the network was to go down. This presents a problem for CockroachDB because the pod network is not routable on the LAN and is certainly not visible cross regions. How do we overcome this?

Using Cilium’s Cluster Mesh for cross-cluster pod communication

Kubernetes is a framework that allows for the integration of many different components. From a networking perspective interfaces can be developed to follow the Container Networking Interface (CNI) standard. This allows us to use different network plugins to provide us with different capabilities.

In this scenario we need inter-cluster-pod routing. The CNI’s that have been developed by the hyperscalers allow this by giving pods an IP address on the Virtual Network. This helps but what if we are not in the public cloud or want to standardize on a single CNI across cloud providers? Here we can use Cilium, this CNI has a capability called Cluster Mesh which allows us to ‘mesh’ different Kubernetes clusters together and for cross-cluster pod-to-pod communication. Let’s look at how we can achieve this in much more detail.

Set up multi-region Azure network configuration and virtual machines

The first thing that we need to do is to prepare the infrastructure. In this example we’ll use Azure but you could use another cloud like AWS or GCP or even a mix of them.

To make it repeatable I have used the Azure CLI to create all of my resources, the first of which is the networking. We create a resource group, this is just a logical container to store all of our resources. Although a resource group belongs to a specific region it can contain resources from other regions as in our case.

az group create --name $rg --location $loc1

We now create a virtual network in each of our chosen regions. We give each virtual network an address prefix along with a single subnet within the address prefix range. None of the regions have overlapping address space, this is really important to ensure routing works as expected and all CockroachDB nodes are able to communicate.

az network vnet create -g $rg -l $loc1 -n crdb-$loc1 --address-prefix 10.1.0.0/16 \ --subnet-name crdb-$loc1-sub1 --subnet-prefix 10.1.1.0/24 az network vnet create -g $rg -l $loc2 -n crdb-$loc2 --address-prefix 10.2.0.0/16 \ --subnet-name crdb-$loc2-sub1 --subnet-prefix 10.2.1.0/24 az network vnet create -g $rg -l $loc3 -n crdb-$loc3 --address-prefix 10.3.0.0/16 \ --subnet-name crdb-$loc3-sub1 --subnet-prefix 10.3.1.0/24

By default the virtual networks are unable to communicate with each other. As a result of this we need to create virtual network peerings, two per region.

az network vnet peering create -g $rg -n $loc1-$loc2-peer --vnet-name crdb-$loc1 \ --remote-vnet crdb-$loc2 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit az network vnet peering create -g $rg -n $loc2-$loc3-peer --vnet-name crdb-$loc2 \ --remote-vnet crdb-$loc3 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit az network vnet peering create -g $rg -n $loc1-$loc3-peer --vnet-name crdb-$loc1 \ --remote-vnet crdb-$loc3 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit az network vnet peering create -g $rg -n $loc2-$loc1-peer --vnet-name crdb-$loc2 \ --remote-vnet crdb-$loc1 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit az network vnet peering create -g $rg -n $loc3-$loc2-peer --vnet-name crdb-$loc3 \ --remote-vnet crdb-$loc2 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit az network vnet peering create -g $rg -n $loc3-$loc1-peer --vnet-name crdb-$loc3 \ --remote-vnet crdb-$loc1 --allow-vnet-access --allow-forwarded-traffic --allow-gateway-transit

In Azure, Network Adaptors are a resource in their own right. These adaptors are associated with a particular subnet and virtual machine.

az network public-ip create --resource-group $rg --location $loc1 --name crdb-$loc1-ip1 --sku standard az network public-ip create --resource-group $rg --location $loc1 --name crdb-$loc1-ip2 --sku standard az network public-ip create --resource-group $rg --location $loc1 --name crdb-$loc1-ip3 --sku standard az network public-ip create --resource-group $rg --location $loc2 --name crdb-$loc2-ip1 --sku standard az network public-ip create --resource-group $rg --location $loc2 --name crdb-$loc2-ip2 --sku standard az network public-ip create --resource-group $rg --location $loc2 --name crdb-$loc2-ip3 --sku standard az network public-ip create --resource-group $rg --location $loc3 --name crdb-$loc3-ip1 --sku standard az network public-ip create --resource-group $rg --location $loc3 --name crdb-$loc3-ip2 --sku standard az network public-ip create --resource-group $rg --location $loc3 --name crdb-$loc3-ip3 --sku standard az network nic create --resource-group $rg -l $loc1 --name crdb-$loc1-nic1 --vnet-name crdb-$loc1 --subnet crdb-$loc1-sub1 --network-security-group crdb-$loc1-nsg --public-ip-address crdb-$loc1-ip1 az network nic create --resource-group $rg -l $loc1 --name crdb-$loc1-nic2 --vnet-name crdb-$loc1 --subnet crdb-$loc1-sub1 --network-security-group crdb-$loc1-nsg --public-ip-address crdb-$loc1-ip2 az network nic create --resource-group $rg -l $loc1 --name crdb-$loc1-nic3 --vnet-name crdb-$loc1 --subnet crdb-$loc1-sub1 --network-security-group crdb-$loc1-nsg --public-ip-address crdb-$loc1-ip3 az network nic create --resource-group $rg -l $loc2 --name crdb-$loc2-nic1 --vnet-name crdb-$loc2 --subnet crdb-$loc2-sub1 --network-security-group crdb-$loc2-nsg --public-ip-address crdb-$loc2-ip1 az network nic create --resource-group $rg -l $loc2 --name crdb-$loc2-nic2 --vnet-name crdb-$loc2 --subnet crdb-$loc2-sub1 --network-security-group crdb-$loc2-nsg --public-ip-address crdb-$loc2-ip2 az network nic create --resource-group $rg -l $loc2 --name crdb-$loc2-nic3 --vnet-name crdb-$loc2 --subnet crdb-$loc2-sub1 --network-security-group crdb-$loc2-nsg --public-ip-address crdb-$loc2-ip3 az network nic create --resource-group $rg -l $loc3 --name crdb-$loc3-nic1 --vnet-name crdb-$loc3 --subnet crdb-$loc3-sub1 --network-security-group crdb-$loc3-nsg --public-ip-address crdb-$loc3-ip1 az network nic create --resource-group $rg -l $loc3 --name crdb-$loc3-nic2 --vnet-name crdb-$loc3 --subnet crdb-$loc3-sub1 --network-security-group crdb-$loc3-nsg --public-ip-address crdb-$loc3-ip2 az network nic create --resource-group $rg -l $loc3 --name crdb-$loc3-nic3 --vnet-name crdb-$loc3 --subnet crdb-$loc3-sub1 --network-security-group crdb-$loc3-nsg --public-ip-address crdb-$loc3-ip3

The final part of the network configuration is the Network Security Groups. These control access to the resources from a network perspective. In this demo we are going to allow access on ports 22 for SSH, 6443 to access the Kubernetes API and port range 30000-32767 which is the NodePort range of Kubernetes.

Step One: Create the NSG one in each region.

az network nsg create --resource-group $rg --location $loc1 --name crdb-$loc1-nsg az network nsg create --resource-group $rg --location $loc2 --name crdb-$loc2-nsg az network nsg create --resource-group $rg --location $loc3 --name crdb-$loc3-nsg

Step Two: Allow SSH Access

Create a rule in each NSG to allow SSH access to the VM’s to enable us to deploy Kubernetes over SSH with k3sup

az network nsg rule create -g $rg --nsg-name crdb-$loc1-nsg -n NsgRuleSSH --priority 100 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 22 --access Allow \ --protocol Tcp --description "Allow SSH Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc2-nsg -n NsgRuleSSH --priority 100 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 22 --access Allow \ --protocol Tcp --description "Allow SSH Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc3-nsg -n NsgRuleSSH --priority 100 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 22 --access Allow \ --protocol Tcp --description "Allow SSH Access to all VMS."

Step Three: Allow Kubernetes API Access

Create a rule in all regions to allow access to the Kubernetes API of each cluster. This will allow us to create the required resources in each cluster to run CockroachDB.

az network nsg rule create -g $rg --nsg-name crdb-$loc1-nsg -n NsgRulek8sAPI --priority 200 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 6443 --access Allow \ --protocol Tcp --description "Allow Kubernetes API Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc2-nsg -n NsgRulek8sAPI --priority 200 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 6443 --access Allow \ --protocol Tcp --description "Allow Kubernetes API Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc3-nsg -n NsgRulek8sAPI --priority 200 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 6443 --access Allow \ --protocol Tcp --description "Allow Kubernetes API Access to all VMS."

Step Four: Allow NodePort Access

Create a rule to open up access to the Kubernetes NodePort range to allow for access to any resources we expose.

az network nsg rule create -g $rg --nsg-name crdb-$loc1-nsg -n NsgRuleNodePorts --priority 300 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 30000-32767 --access Allow \ --protocol Tcp --description "Allow Kubernetes NodePort Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc2-nsg -n NsgRuleNodePorts --priority 300 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 30000-32767 --access Allow \ --protocol Tcp --description "Allow Kubernetes NodePort Access to all VMS." az network nsg rule create -g $rg --nsg-name crdb-$loc3-nsg -n NsgRuleNodePorts --priority 300 \ --source-address-prefixes '*' --source-port-ranges '*' \ --destination-address-prefixes '*' --destination-port-ranges 30000-32767 --access Allow \ --protocol Tcp --description "Allow Kubernetes NodePort Access to all VMS."

Remember this is a demo and not designed for Production use so consider more restrictive rules in a real world scenario. Below is a diagram that depicts these resources.

The final element of the infrastructure is the nine virtual machines that will support the installations of Kubernetes. Three virtual machines will be deployed into each region. Region One:

az vm create \ --resource-group $rg \ --location $loc1 \ --name crdb-$loc1-node1 \ --image UbuntuLTS \ --nics crdb-$loc1-nic1 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc1 \ --name crdb-$loc1-node2 \ --image UbuntuLTS \ --nics crdb-$loc1-nic2 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc1 \ --name crdb-$loc1-node3 \ --image UbuntuLTS \ --nics crdb-$loc1-nic3 \ --admin-username ubuntu \ --generate-ssh-keys

Region Two:

az vm create \ --resource-group $rg \ --location $loc2 \ --name crdb-$loc2-node1 \ --image UbuntuLTS \ --nics crdb-$loc2-nic1 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc2 \ --name crdb-$loc2-node2 \ --image UbuntuLTS \ --nics crdb-$loc2-nic2 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc2 \ --name crdb-$loc2-node3 \ --image UbuntuLTS \ --nics crdb-$loc2-nic3 \ --admin-username ubuntu \ --generate-ssh-keys

Region Three:

az vm create \ --resource-group $rg \ --location $loc3 \ --name crdb-$loc3-node1 \ --image UbuntuLTS \ --nics crdb-$loc3-nic1 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc3 \ --name crdb-$loc3-node2 \ --image UbuntuLTS \ --nics crdb-$loc3-nic2 \ --admin-username ubuntu \ --generate-ssh-keys az vm create \ --resource-group $rg \ --location $loc3 \ --name crdb-$loc3-node3 \ --image UbuntuLTS \ --nics crdb-$loc3-nic3 \ --admin-username ubuntu \ --generate-ssh-keys

K3s Kubernetes Deployment

In the demo, we’ll be using k3s as the Kubernetes distribution. This is a lightweight, CNCF certified version of Kubernetes that runs as a single binary. This makes it easy to deploy and quick to start. The deployment tool k3sup (said ‘ketchup’) can be used to deploy k3s to our virtual machines. K3s has a Server Agent architecture. In this demo one server and two agents will be deployed in each region. Each of these nodes will run all the roles of Kubernetes (Control Plane, etcd, worker). In a production environment it is recommended to separate these roles out on to separate infrastructure.

As part of the install process the default CNI of Flannel is disabled to allow for the deployment of Cilium. This will be the network provider that provides the capability to connect the clusters together and allow cross-cluster network communication. We can download and install k3sup with the following command.

curl -sLS https://get.k3sup.dev | sh sudo install k3sup /usr/local/bin/ k3sup --help

Then we can store the IP address of the Server node as an environment variable using the Azure CLI and then deploy k3s.

MASTERR1=$(az vm show -d -g $rg -n crdb-$loc1-node1 --query publicIps -o tsv) k3sup install \ --ip=$MASTERR1 \ --user=ubuntu \ --sudo \ --cluster \ --k3s-channel stable \ --k3s-extra-args '--flannel-backend none --disable-network-policy' \ --merge \ --local-path $HOME/.kube/config \ --context=$clus1

Then deploy it k3s to the agents using the same approach.

AGENT1R1=$(az vm show -d -g $rg -n crdb-$loc1-node2 --query publicIps -o tsv) k3sup join \ --ip $AGENT1R1 \ --user ubuntu \ --sudo \ --k3s-channel stable \ --server \ --server-ip $MASTERR1 \ --server-user ubuntu \ --sudo \ --k3s-extra-args '--flannel-backend=none --disable-network-policy' AGENT2R1=$(az vm show -d -g $rg -n crdb-$loc1-node3 --query publicIps -o tsv) k3sup join \ --ip $AGENT2R1 \ --user ubuntu \ --sudo \ --k3s-channel stable \ --server \ --server-ip $MASTERR1 \ --server-user ubuntu \ --sudo \ --k3s-extra-args '--flannel-backend=none --disable-network-policy'

Now we just need to repeat the previous two steps for the other two regions.

How to use Cilium Cluster Mesh

In this demo the Cilium CLI is used to deploy the CNI and configure Cluster Mesh. In a production environment a GitOps approach could be used to automate the deployment. The Cilium CLI can be downloaded and installed using the following command.

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz{,.sha256sum} sha256sum --check cilium-linux-amd64.tar.gz.sha256sum sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin rm cilium-linux-amd64.tar.gz{,.sha256sum}

The architecture of Cilium Cluster Mesh control plane is based on etcd. Each cluster maintains its own instance of etcd. This contains the current state of that cluster, state from multiple clusters are never mixed in etcd. Each cluster exposed its own etcd via a set of proxies. Agents running in other clusters connect to the proxies to watch for changes in cluster state and replicate the multi-cluster changes to their own clusters. Access is protected with TLS certificates. Access to a cluster’s etcd from other clusters is always read only. This ensures that failures in one cluster never propagate into other clusters. Configuration occurs via a simple Kubernetes secrets resource that contains the addressing information of the remote etcd proxies along with the cluster name and the certificates required to access the etcd proxies.

The pod IP routing is the primary capability of the multi-cluster feature. It allows pods across clusters to reach each other via their pod IPs. Cilium can operate in several modes to perform pod IP routing. All of them are capable of performing multi-cluster pod IP routing.

Tunneling mode encapsulates all network packets emitted by pods in a so-called encapsulation header. The encapsulation header can consist of a VXLAN or Geneve frame. This encapsulation frame is then transmitted via a standard UDP packet header. The concept is similar to a VPN tunnel.

The pod IPs are never visible on the underlying network. The network only sees the IP addresses of the worker nodes. This can simplify installation and firewall rules.

The additional network headers required will reduce the theoretical maximum throughput of the network. The exact cost will depend on the configured MTU and will be more noticeable when using a traditional MTU of 1500 compared to the use of jumbo frames at MTU 9000.

CockroachDB Deployment Across Multiple Regions

In the github repository for CockroachDB there is a Python script to help automate the deployment of CockroachDB across multiple regions. We need to update the Python script with the name of the Kubernetes contexts and regions that were created in the previous steps.

contexts = { 'eastus': 'crdb-k3s-eastus', 'westus': 'crdb-k3s-westus', 'northeurpoe': 'crdb-k3s-northeurope',}

regions = { 'eastus': 'eastus', 'westus': 'westus', 'northeurope': 'northeurope',}

Once saved, the script can be run. This script deploys all the resources required across the three regions to run CockroachDB. Once this has been completed there is one final step to complete.

In each Kubernetes cluster there is a deployment of CoreDNS, this is responsible for name resolution in the Kubernetes cluster where it is deployed. It cannot however resolve names from other Kubernetes clusters without changes made to a ConfigMap containing the configuration of CoreDNS. This configuration can be updated to include forwarders for the two other clusters so pod in one cluster and resolve names of pods in another cluster. This is required for CockroachDB, all nodes in the cluster must be able to communicate with each other. Below is an example of the changes required in each Kubernetes cluster:

westus.svc.cluster.local:53 { # <---- Modify log errors ready cache 10 forward . IP1 IP2 IP3 { # <---- Modify } } } northeurope.svc.cluster.local:53 { # <---- Modify log errors ready cache 10 forward . IP1 IP2 IP3 { # <---- Modify } }

Apply the updated ConfigMaps to each of the clusters and the cluster should be established.

Networking Principles for Multi-Region

When you embark on your CockroachDB journey to a multi-region cluster remember not to forget some basic networking principles before you start your deployment:

• Connectivity, all the pods in your cluster need to be able to talk to each other so pick a solution that is going to allow this. Changing it will be disruptive! DNS!

• Connectivity is one thing but being able to perform name resolution can also be a tall order. Each cluster has its own DNS deployment and has no awareness of your other clusters. So make sure you plan for this also.

Once you have these sorted out it is pretty much smooth sailing. Here is a link to a github repo that contians more detailed step by step instructions with all the code required for you to deploy this solution in your own subscription. If you give this a try and you have questions, you can ask them in our community slack or just