Database scaling strategies: A practical approach

In tech, we hear the importance of “scale” all the time. People plan for it, try to work around not having it, and build companies to help others achieve it. But when it comes time to scale something yourself or integrate a scalable solution with your app, it’s difficult to find practical guides to help you understand what it takes.

Why’s that? Well, it’s kind of hard. Actually scaling a database beyond a single availability zone takes considerable planning and engineering investment––but that being said, it’s an incredibly powerful tool to delight your users with low latencies and high availability.

To help you understand some high-level considerations, this post will cover:

Understanding your database in the context of scale
Determining both performance & geographic strategies

If this is interesting to you, we also have a guide that covers more detail of the points above, as well as:

Deployment strategies for multi-region database deployments
Networking to handle complicated use cases

Interested? Download our guide, Scaling Databases with Multi-Region Deployments.

Using Your Distributed Database of Choice

Understanding Your Database in a Multi-Region Context

The best place to begin is in understanding how the database you want to use actually works once you’ve scaled to multi-region context––including any workarounds you might need to develop along the way.

Active-Active/Multi-Master

In an ideal world, your database would be able to handle both reads and writes everywhere in the world. For many databases, though, “active active” deployment poses a lot of difficulties, largely due to conflicts that inevitably occur when multiple nodes accept writes (for a great primer, checkout pg. 168 of Martin Kleppmann’s book Designing Data Intensive Applications).

CockroachDB is a notable exception to this paradigm. Because of its consensus replication, CockroachDB lets you read and write from all nodes in the cluster without generating conflicts (which we call “multi-active availability”). This represents the easiest way to have your database span multiple regions.

Replication Impact

When your database is deployed to multiple regions, it has to replicate data between nodes in your deployment. One of the first things to understand is how scaled out––especially multi-region––deployments impact your particular database.

For example, if you use a NoSQL database with eventual consistency, how “eventual” does the consistency become when replicas span continents? It’s important to substantiate this not to dissuade a team from scaling out to multiple regions, but to ensure your SLAs and engineering efforts account for it. You'll also need to account for how your application deals with this replication delay, and optionally whether it could lead to conflicting writes overwriting each other.

Another consideration is how replication impacts your ability to comply with regulations like General Data Protection Regulation (or GDPR). If you’re using Amazon RDS cross-region read replicas, for instance, you have to ensure that the data is not being replicated to locations outside of the EU (for customers who disallow you to move their data outside of the EU).

Understand Data Domiciling

It’s crucial when you interact with user data to err on the side of caution. Failure to comply with GDPR––by processing user data outside of the EU without consent––results in crippling fines. So, even though many EU residents will allow for their data to be stored and processed outside the EU, you will do better to account for those that insist on strict data domiciling.

In short, this means that your database must have the ability to partition data based on some row-level key, or your application must contain routing/gateway logic to ensure that writes reach the correct table in your database.

CockroachDB offers replication zone configuration to provide table-level control for distributing data, and our Enterprise version will offer row-level partitioning in 2.0, which can radically simplify GDPR compliance.

Geographic Strategy

The most important part of a multi-region deployment is the regions themselves. This is ultimately the factor that’s going to provide the bang for your buck. By placing data close to users, they'll get a better, faster experience because their requests are traveling shorter distances.

Know Your Audience

First things first: you need to know where your user base is. This can be as simple as polling your user database and finding the most granular piece of geographic data you can.

If we were to imagine this as a SQL query, it would look like:

SELECT region, COUNT(*) FROM users GROUP BY region;

From there, you have an idea of where you need to serve data from (and which regions to invest most in).

Understand Regulations

+When analyzing your customer base, you might identify users in the EU or China whose experience you’re concerned with.

Before committing to deploy your app to these regions, it’s crucial to have a clear understanding of the implications of what deployments in these places entail (lest you fail to comply with the regulation and incur a crippling penalty).

While we’re not lawyers, the general guidance here is that you must domicile user’s data from these regions in these regions. For example, Chinese users’ sensitive data must be kept in China. When dealing with the EU, the regulations can become even more stringent because you also cannot necessarily process user data outside of the EU without a user’s explicit consent.

If you did your due diligence in assessing how your database works when it’s scaled across regions, you should have a clear sense of your technological capabilities to work within a region.

Performance Strategy

Every undertaking benefits from metrics to understand its success, and setting those out beforehand will help you and your team substantiate the impact of multi-region deployments on your application.

Multi-region deployment will provide you two major upsides: speed and availability. To make sure you can substantiate your work’s impact, it’s important to develop clear strategies and benchmarks for both.

Latency Goals & Strategy

By moving data close to users, you’re removing network latency between them and your application. You can do some very rough back-of-the-envelope calculations by standing up a VM in a region you’re considering a deployment to and pinging it.

You will be able to shave approximately that much time off of the requests of users that are near the zone in the region where you deploy. This is a very rough approximation, though, and is useful only in establishing a notion of what's possible. Of course, your actual requests will still take a few milliseconds (2-10ms assuming high-speed internet is available) within the region.

With an understanding of the performance gains you can make, you should examine the services you’re connecting and determine the largest area of impact you can make by deploying those across regions. It’s also important to note that services that you don’t deploy to a region will begin incurring latency equal to the gains you’re making elsewhere; that doesn’t mean you need to replicate every service in every region, but is an important factor in determining your SLAs.

Another factor to consider is the trade-off between consistency and latency. If consistency doesn’t matter and your application can tolerate potentially losing data, you can often speed up the request by making it asynchronous. If that’s not the case, synchronous requests can be somewhat slower, but can provide ACID (or ACID-like) guarantees.

Availability Goals & Strategy

By distributing a service among two machines with 90% availability (which is pretty lousy, equaling 3 days of downtime per month), you can achieve 99% availability, also known as “two nines.”

If you’re dealing with more robust services, it’s easy to start achieving three or four nines. These calculations, though, assume that the services’ availability is independent of one another, though––but if they’re in the same datacenter and the entire datacenter goes down, so does your application.

However, to ensure this, you also have to account for things well beyond your control––which might very well be why you’re considering a multi-region deployment in the first place. So it’s important to have a contingency plan in place assuming the entire region will go offline. How to accomplish this and what it means largely depends on the database you’re using and the level of consistency your application requires.

Because CockroachDB automatically repairs and rebalances itself, you don’t need special strategies in place to handle failovers. By simply balancing load to other nodes, your application will continue serving requests.

Low Hanging Fruit: Keep Your Deployment Up

To ensure your services remain available, you also need to account for trivial and common problems: tedious things like VMs going down.

Fortunately, these are relatively simple problems to solve by leveraging configuration management tools like Chef, Puppet, or Ansible. Using these (coupled with monitoring), you can automatically spin up new replicas. With a distributed database, this can dramatically improve a service's uptime (assuming the provisioning process is tuned well).

...and what else?

Of course, there are many more considerations when you begin the work of scaling your database. To get more detail––including potential architectures for most common distributed databases––download our guide Scale Databases with Multi-Region Deployments.

Illustration by Rebekka Dunlap