Why database outages still happen: the limits of high availability

Why database outages still happen: the limits of high availability
[ Guides ]

O'Reilly Definitive Guide to CockroachDB

We wrote the book on building unkillable apps. Literally.

Get Free Copy

Disasters happen. (Most) outages shouldn’t.

When multi-billion dollar companies like Zoom, Slack, and Fanduel experience outages, the reaction from users tends to be anger and surprise. Anger, because people need to get their work done and set their fantasy football lineups. Surprise, because it’s 2022. Isn’t high availability the norm?

Not all outages are created equal. Though they have a similar impact on end users, they happen for all kinds of reasons. And some of these reasons are preventable.

While many outages look the same (applications down, murky company response about the cause), there’s a lot of nuance to what’s happening under the hood. If we take a closer look, we can see what is preventable today, and what (as an industry) we are still working on.

The evolution (and limits) of high availability

In order to understand why outages are still happening, it’s important to look at the history of high availability. The evolution of highly available systems–and looking at where companies currently sit on that spectrum–indicates a lot about what their capabilities are and what steps they can take to become more resilient.

Highly available systems used to be systems with redundant, hot-swappable power supplies, disk drives, and even CPUs. Nowadays, we recognize that there are better pathways to high availability than making a single machine highly available. Instead, we turn to making services highly available by using large clusters of machines, where any node in the cluster can fail without bringing down the service.

This means that a disaster that affects one node–like a fire in the datacenter, or the proverbial backhoe–doesn’t cause the entire service to go dark. Resilient, distributed systems that are built to failover in cloud-native ways are no longer prone to outages like this.

So why do even highly available systems at tech companies like Facebook and Google fail? The nuance regarding these kinds of failures is that even with data storage systems that can survive datacenter failures, there can still be outages. Take, for example, a network configuration error with Google that took down half the internet in Japan. Human error isn’t going to be eliminated by the next innovation in highly available services.

Natural disaster caused outages, on the other hand, can be a thing of the past. Resilient systems that can handle a datacenter outage of this kind already exist, and are in place at tech-forward companies. The future is here. But as William Gibson said, “it’s just not very evenly distributed.”

Three ways to make data resilient to disaster

After Wells Fargo’s outage in 2019, the company spoke about routing to backup datacenters, and the failover datacenter failing. While we don’t know the details of their data storage configuration, that language is outdated, and indicates that they weren’t using the cloud-native systems that provide the resiliency we see at contemporary tech companies (at this point I assume Wells Fargo has modernized their data storage configuration).

This is largely because the data solutions sold to older enterprise companies–like Oracle’s solutions–aren’t cloud native, and therefore can’t provide the resiliency we’ve come to expect in 2022. Companies using this legacy tech are often locked into a long contract, or haven’t made the leap to the cloud yet because of the perceived cost of switching. In the meantime, they’re forced to focus on disaster recovery instead of disaster prevention. And they’re going to stay that way until they start truly building for resiliency, and getting off legacy tech.

While some of the tech used at companies like Google and Netflix is internal to them, a lot of these solutions are available off the shelf. Here are three ways you can build for resiliency today:

1. Automate disaster testing

Simulating disasters (and recovering from them) shouldn’t be a manual process. Companies that build this into their everyday processes, like Netflix and its Simian army, have a working model that makes emergency protocols not an edge case, but the norm.

2. Choose a self-healing database

Choosing a database that self-organizes, self-heals, and automatically rebalances is an important component of resilience. By making component failure an expected event that a system can handle gracefully, you’ll be prepared when a datacenter goes down for whatever reason.

3. Replicate to have redundancy built into the system

To survive disasters, you need to replicate your data and have redundancy built into the system to eliminate all points of failure. Prepare for disaster by having data transactionally replicated to multiple data centers as part of normal operation.

The path towards resiliency already exists. But a lot of organizations need an initial push to get off the legacy tech they use and onto the cloud-native solutions they need.

If you want to learn more about surviving database disasters and the differences between legacy database failover and distributed database failover watch this video:

About the author

Peter Mattis github link

Peter is the co-founder and CTO of Cockroach Labs where he works on a bit of everything, from low-level optimization of code to refining the overall design. He has worked on distributed systems for most of his career, designing and implementing the original Gmail backend search and storage system at Google and designing and implementing Colossus, the successor to Google's original distributed file system. In his university days, he was one of the original authors of the GIMP and is still amazed when people tell him they use it frequently.

Keep Reading

Tech trends and challenges in the retail industry

Tomorrow, as you undoubtedly know, is day-one of an annual two-day retail event known as “Amazon Prime Day”. …

Read more
5 reasons to build multi-region application architecture

TL;DR - Multi-region application architecture makes applications more resilient and improves end-user experiences by …

Read more
Control data latency and availability with a few SQL statements

Slow applications kill business. Greg Lindon (in this now archived deck), noted that 100ms in latency lowered Amazon’s …

Read more