Getting to zero downtime: what's at the base of your app stack?

Reliability is your most important feature.

It’s easy to get bogged down in the minutiae of planning, building, and shipping new product features. But at the end of the day, if your application isn’t available (or customers are turned away by a sluggish experience), none of those features matter.

So how do you make sure that your application is always available and performant?

Applications are now load-balanced and scaled horizontally because they tend to be stateless. If applications are load-balanced, why isn’t your database horizontally scaled and distributed?

What is a zero downtime strategy?

A zero downtime strategy is a collection of the infrastructure, architecture, and design decisions organizations make to ensure that their applications remain online and performant.

Note: the goal of zero downtime should be considered an aspirational “north star” rather than a literal goal. It’s not possible to build a system that will remain online through any possible contingency. It is possible to build systems that can get pretty close, with four- or five-nines availability - or even better - with a bit of planning.

Outages are common, costly, and (often) preventable

Outages happen a lot more than you might think, even for cloud-native companies. In 2022, for example, more than 60% of companies that use public clouds reported losses due to cloud outages. And those costs are significant: some estimates peg the average cost of downtime at $365,000 per hour.

Large enterprises can expect to lose significantly more than that if critical services go offline – and that’s not including the hidden costs of damage to their reputation, which can be significant. Users who even hear about a bad experience with your app may decide it’s safer to try a competitor next time.

Cloud outages can be longer than people expect, too. In 2021, nearly 30% of cloud outages caused service downtime of more than 24 hours. Per the average cited above, that unlucky 30% can expect their costs for a single outage to approach $10 million (and very possibly exceed it, if they’re a larger company or if the outage persists beyond the 24-hour mark).

Outages can be caused by a wide variety of factors – software bugs, human error, sprinkler systems in your cloud provider’s data center, etc. But in the cloud era, these kinds of events don’t have to cause any downtime for your application. You can enhance your application’s survival by setting it and its database up in a third or fourth availability zone.

The role of database software in achieving zero downtime

In the past, a single data center going offline was considered a Black Swan event. Back then, most applications weren’t capable of spanning multiple data centers, so the redundancy required for businesses was initially handled as a series of engineering problems and solved at the hardware and data center level (space, power, and cooling).

Now that software is eating the world and has taken on these physical redundancies in software, organizations have embraced diversification and now spread risk across multiple data centers. For databases, the most common approach is to replicate data from a primary to a secondary location, ameliorating the risk of a single data center outage by allowing for failover to the secondary backup.

But that active-passive approach has significant downsides. In 2023, we can do better than what was state-of-the-art at the turn of the century and improve both reliability and costs by reducing risk with newer, more modern software and software algorithms that abstract more of the downstream components for a database.

Outages are an inevitability; a when, not if. Your zero downtime strategy has to be centered around building resilient systems that can survive the most common and preventable forms of outage, including at the database layer.

Important considerations for a zero downtime database

In broad strokes, a zero downtime strategy generally calls for embracing distributed, loosely-coupled, self-healing systems. Many elements of the modern stack have already gone this way via CDNs, distributed stateless applications, etc., there’s one part of the stack that often remains stuck in the past: the database.

Let’s take a closer look at how we can reduce the chances of database downtime by embracing a distributed, loosely-coupled, self-healing approach.

Distributed

First, a zero downtime database is necessarily a distributed service. The reasons for this are fairly self-evident: a single-instance database is a single point of failure. If the machine running your single-instance database – whether it’s PostgreSQL, MySQL, Mongo, Oracle, etc. – gets knocked offline, every part of your application that relies on that database goes offline with it because the database is a piece of shared infrastructure for all of its up-stack dependencies.

The extent to which you distribute your database across fault domains will depend on the realistic uptime goals you’ve set, your other business needs, and your Distributed SQL technology of choice. Does your application need to be able to survive node failure? AZ failure? Region failure? A whole-cloud outage? The answers to these questions will help determine the type of distributed deployment you need. (But if you don’t already know you need a multi-region or multi-cloud setup, chances are that distributing across AZs within a single cloud region will probably be sufficient).

Note: Using and operating distributed systems is inherently complex. There are no shortcuts around the speed of light. But architects of distributed systems should strive to reduce complexity – a hidden organizational cost – to the greatest extent possible. Some organizations do elect to divert engineering resources to implement a bespoke sharding layer, which can allow them to emulate distributed system capabilities using legacy relational databases. But this requires writing, maintaining, and scaling what can quickly become a rat’s nest of code to route queries to the correct shards, all of which is likely orthogonal to a business’s goal of providing a more rich feature experience to their users. Teams spending time focusing on writing and maintaining a sharding layer – and not shipping valuable code to advance your business – gives competitors an edge in the marketplace.

Migrating to a distributed SQL database handles all of that sharding and operational work automatically, which will save you both time, money, and a lot of headaches in the long run because you end up having to maintain a much less complex application stack. Would your organization rather invest in a bespoke sharding layer, or implement a database as a service platform that has inherent self-healing and horizontal scaling characteristics?

Loosely coupled

Second, a zero-downtime database should be loosely coupled. Database nodes must remain in sync with each other, but a multi-node system with synchronous replication, such as PostgreSQL’s synchronous_commit=remote_apply, that requires all nodes to be online is actually less resilient than a single-instance database in many cases.

For example, consider that (as of this writing) AWS’s monthly SLA is 99.5% uptime for instances within a single region. So, a single-instance database running on AWS could expect a monthly uptime of 99.5%.

However, if you’ve got a three-node database cluster using synchronous replication, then any node going offline is going to knock the database offline. Because all three instances are subject to that same 99.5% SLA, your uptime actually decreases to 98.5% (99.5 * 99.5 * 99.5), which is over 10 hours of downtime per month.

Asynchronous replication and semi-synchronous replication exist as alternatives, but each comes with its own problems, including some sneaky ways you can end up losing data without realizing that’s possible until it’s too late:

How data gets lost: synchronous vs. asynchronous vs. semi-synchronous vs. consensus-based replication

A better approach is consensus-based replication, a type of synchronous replication which allows for synchronous writes but (in the context of a three-node setup, i.e., RF=3) also allows for data to be committed to two nodes even if the third is offline, so your data remains available even in the event of a node outage. In clusters with a replication factor of five, RF=5, it’s possible to survive a double-node failure.

In this sort of loosely-coupled system, the operational math changes. Returning to our AWS example with a replication-factor of three, because we can survive the failure of a single node, we can now expect 99.999% uptime (1 - (0.005 * 0.005 * 0.005)), or a mere 26 seconds of downtime per month.

Self-healing

Finally, a zero-downtime database should be self-healing. It isn’t possible to remove entropy from complicated distributed systems, but it is possible to adapt to ever-changing application, environmental, or other runtime constraints (e.g., planned and unplanned maintenance, DDL, software version upgrades, unexpected internal or external load).

This is important because, in a consensus-based replication system – such as that used by CockroachDB – a majority of nodes must agree before a transaction is committed. In a replication-factor three setup, once a node has gone offline (for any reason, including a temporary network hiccup), its data will also be offline, and any additional node failures could result in a situation where there aren’t enough replicas of the data left online to achieve consensus, at which point the database becomes unavailable.

Self-healing in the database remediates this risk by automatically redistributing data. The extent to which that’s possible will depend on the specifics of your setup – node count, replication factor, etc. – but CockroachDB, for example, will automatically up-replicate data from unavailable nodes to restore your desired replication factor when possible, and automatically rebalance data when dead nodes come back online (or new nodes are added to a cluster).

Of course, this article is just a simple overview of a rather complex topic. Want to dig in in more depth? Check out our recent webinar on this topic, hosted by Sean Chittenden, Engineer at Cockroach Labs. Where are you on your zero-downtime journey? How are you helping the rest of your organization improve its resiliency?