Imagine this: you work in system architecture for a multibillion-dollar consumer-facing business, and it’s the middle of a busy weekend.
Suddenly, your database goes down.
Transactions aren’t processing. Customers are angry. Logistical issues are stacking up because inventory and warehouse tracking are down. Customer service reps are swamped, but half of the technical staff are twiddling their thumbs because they can’t do their work without a functional database.
And on top of it all, the burning question: what are we losing with every second this issue doesn’t get fixed? Broken transactions mean lost revenue, angry customers mean lost reputation, and blocked technical staff means lost time.
Outages can also result in something even more costly: lost data. When the database does come back online, how much will be missing? Gaps in data can impact the value of analytics and influence strategic decisions for years.
That’s a nightmare most people don’t even want to think about. But it happens. In fact, it happened twice to one major North American retailer, costing them millions.
Although companies don’t like to talk about them, database outages happen more often than you might think. Just last year, an outage in Amazon AWS’s US East-1 region knocked thousands of services offline for hours in the run-up to the busiest shopping weekend of the year.
And of course, those are just the unplanned outages. Many database systems will require you to shut down briefly to do things such as upgrade the database software or modify your schema. This downtime can be scheduled to minimize its business impact (often in the middle of the night) but it still comes with cost and risk. If something goes wrong – if there’s an unexpected bug in the software upgrade, for example – short planned outages quickly become lengthy, unplanned, costly ones.
But outages don’t have to be an inevitability!
Let’s look at some real examples of database outages at three companies that were collectively responsible for more than $250 billion in revenue last year. We’ll dig into what went wrong, and how they ultimately solved their problems by building more resilient, high availability systems.
All three of our companies – we’re keeping it anonymous to avoid causing embarrassment – suffered database outages that caused real damage.
Company 1, a major North American retailer with thousands of locations across the US, Canada, and Mexico, was using Google Spanner to manage one of their major transaction databases. Being tied to a single cloud provider (GCP) made them susceptible to service outages, and they experienced two of them – both on Saturdays, their busiest shopping days.
Although both outages were resolved within hours, when you’re operating at scale, it doesn’t take long to do damage. Company 1 estimates that they lost millions of dollars in both outages. Concerned that outages could keep occurring if they stuck with their existing system, they began looking for something better – a cloud-neutral database system that would be unkillable even if one of their cloud providers went down.
Company 2, a popular delivery app worth nearly US$50 billion, had a similar experience. Like many delivery companies, they saw an unexpected spike in business with the onset of the COVID-19 pandemic. Their Amazon Aurora system struggled to scale quickly enough to meet rising demand, resulting in repeated outages, including one that spanned several hours on a Saturday.
These outages were expensive – the company estimates they cost several hundred thousand dollars per hour in lost revenue alone. But of course, the true cost of each outage was much higher. Fixing outages costs engineering time, blocks employees who can’t work effectively without the database, and damages the brand’s reputation with customers.
Company 2 knew it needed to find a database system that could scale to meet fluctuating demand without the risk of demand spikes or cloud outages knocking its app offline.
Company 3, a major telecom provider worth more than $250 billion, also experienced a damaging outage. They had a customer service chatbot built on Amazon Aurora, and a loss of connectivity to their AWS region knocked the entire app offline, infuriating already-frustrated customers who had come to them for help.
Even before the outage, Company 3’s engineers had not been thrilled by the performance of their chatbot. Because their database system could not geolocate data effectively, the app often took 5-10 seconds to load as it struggled to pull a user’s data from distant servers.
After the outage, Company 3 resolved to replace their Aurora system with a multi-cloud setup that wouldn’t force them to rely on a single cloud provider. They also wanted a system that could offer much faster local read performance to improve the app’s user experience.
All three companies ultimately found a solution to their outages: CockroachDB. CockroachDB helps them avoid the hassle of planned outages because its distributed nature means they can make software upgrades, modify schema, etc. without having to take their databases offline.
It has also helped all three achieve high availability and avoid costlier, scarier unplanned outages.
Company 1 is now running their logistics, online, and in-store payment databases on CockroachDB. This allows them to take advantage of CockroachDB’s inherent ability to survive anything. CockroachDB’s built-in support for multi-cloud means that they’re no longer reliant on a single company’s cloud infrastructure, so a GCP service outage can no longer knock their system offline.
Company 2 is migrating dozens of its applications to CockroachDB, making it far more resilient to fast spikes in demand thanks to CockroachDB’s automated elastic scaling. What’s more, it has been able to upgrade its database to solve its outage problem without having to completely overhaul its tech stack or abandon Postgres, because CockroachDB supports the PostgreSQL wire protocol and most Postgres syntax.
Company 3 switched their app to operate using a hybrid cloud setup powered by CockroachDB, and it has completely solved their resiliency issues, allowing them to survive total region failures like the one that knocked their app offline. CockroachDB’s extensive support for geolocation also allowed them to store replicas of users’ data in the data center closest to them, so users now experience much faster app load times.
Company 3 was so pleased with having a highly-available, outage-proof app and an improved user experience that they have begun to replace some of their other mission-critical databases with CockroachDB, too.
The lesson from these companies is clear: no server is immortal. Even the major cloud providers have outages.
The only way to stay online is to invest in a resilient, distributed, cloud-neutral system like CockroachDB so that you can survive node, region, or even cloud provider outages without losing your revenue, your customers, or your data.