Ever been cruising along on some work that’s going really well when your laptop suddenly bricks? Or maybe you’re about to join an important meeting when Zoom announces it has to update right now and takes itself offline to install and restart?
We’ve all been there. Which means we have all experienced first hand the disruption, the frustration, and occasionally the keyboard-pounding rage of downtime. Now imagine 100000 x’ing that pain up to full-scale organization level where downtime happens for everyone, all at the same time. We’ve all seen the headlines: Airline Cancels Thousands of Flights Due to Network Outages or Cloud Region Failure Makes Retailer Websites Go Dark on Black Friday, Busiest Shopping Day of the Year.
The technical definition of downtime is “a period of time when technology services are unavailable to users”. This elegant simplicity, however, completely misses both the potentially serious business impacts and the deeply human pain that results whenever downtime disrupts work. The online sports betting service whose customers simply switch their wagers to a competing platform (and then never come back) The business traveler parent who misses their kid’s birthday party because their flight got canceled, or the small artisan counting on Black Friday sales to take their business into the black.
There are two types of downtime, planned and unplanned. The ultimate outcome and experience is the same, and they both come with costs. It’s important to understand the difference, though, since one kind of downtime is manageable — even avoidable — and the other is not.
Planned downtime happens on purpose. Planned downtime happens to implement updates, upgrades, and configuration changes: when your software suddenly takes itself away for that Zoom update, for example, or when you decide to take your laptop in for service because the cooling fan keeps randomly firing up like a typhoon over Tuvalu. The purpose of planned downtime is actually to prevent unplanned downtime through preventive maintenance to keep your machines and applications at optimal functionality. Planned downtime is typically scheduled for off-period times to minimize disruption.
Unplanned downtime on the other hand, is unexpected. It can strike at any time due to an endless variety of disasters, from cloud provider outages to climate change to fat-fingered YAML files, and it does not care about your company schedule.
Both flavors of downtime, planned and unplanned, are common across every kind of organization in every type of sector. Both are disruptive and carry costs. Both lead to a loss in revenue and productivity, and both can potentially trash your business reputation. No matter what the cause, downtime makes a lot of people, users and IT teams alike, very unhappy.
Any kind of downtime, then, has significant potential impact on a business, including both direct and indirect costs. Whether planned or unplanned, the costs of downtime may include:
The shockwave of downtime can also ripple out across an entire organization. As companies increasingly use their IT stacks to knit together their operations, system downtime now can now hamper the productivity of almost everyone in the organization and even completely sideline some teams.
The financial cost of downtime is where the differences between planned and unplanned outages begin to emerge: Unplanned outages tend to cost companies significantly more than planned downtime for maintenance and updates.
This makes sense, because organizations often are unprepared for unplanned downtime. When it strikes, the reaction and recovery time required to come back from an outage presents a loss of productivity and money on top of the other potential downtime costs listed above. On average, studies show, unplanned downtime costs 35% more per minute than planned downtime.
The direct costs of unplanned downtime can be staggering: An IBM Global Services study puts the average revenue cost of an unplanned application outage at over $400,000 per hour for large enterprises in any sector. And these outages add up. The same study shows one in three organizations experience unplanned downtime on a monthly basis. Similar research from Dunn & Bradstreet shows that 59% of Fortune 500 companies experience a minimum of 1.6 hours of unplanned downtime per week. That’s a lot of productivity lost, and a lot of money down the drain.
Unlike unplanned downtime, tech teams can schedule, monitor, and control planned downtime. It’s almost always scheduled for off-peak times like weekends or holidays to minimize the impact. Also, when the service interruption is expected, additional options are available to further minimize the impact, like implementing temporary workarounds or deferring tasks that are not time-sensitive. Because businesses have this level of control, the common belief is that planned downtime doesn’t result in significant business impact. But, in fact, planned downtime also can incur significant if under-recognized financial costs.
The IBM study quoted above estimates that, on average, the costs of planned downtime for enterprise organizations adds up to $5.6 million per year. But maintenance must be done: avoiding scheduled downtime that handles security updates, bug patches and version upgrades results in a greater risk of unplanned downtime.
This is likely why many organizations view planned downtime as simply inevitable, a cost of doing business to be factored into any application production schedule. This, however, is mistaken thinking because planned downtime is not inevitable.
Regular scheduled maintenance is inevitable. Planned downtime to implement these updates does not have to be.
Updates and patching for applications and their underlying infrastructure is essential practice, since these help minimize the risk of some types of unplanned downtime (like outages due to outdated components, software bugs, or security vulnerabilities). Choosing resilient application architecture makes it possible to do all of this during weekday business hours and with zero downtime.
Each architecture design will be different, according to the organization’s business needs and the application itself. However, each component of a resilient stack can – and should – be evaluated on its ability to allow live updates and changes without ever taking the application offline. The database in particular needs to have this live update capability, due to the risk of data loss or corruption that occurs when taking a database offline — and also due to the fact that changing the database schema is a frequent occurrence for many applications.
Unfortunately, most traditional relational databases such as MySQL require developers to lock tables during schema changes, effectively taking the database offline. Some relational databases can handle some live schema changes – Postgres, for example, offers more live update functionality than some others – but this functionality is still pretty limited.
The reality is that, for most relational databases, the only way to update the schema without downtime is by creating an exact duplicate or replica of the data and then operating both instances in parallel. This allows for updating the schema in the replica, gradually migrating your application over to it, and then killing off the original. But doubling your database, even temporarily, also means doubling your costs just to prevent downtime during schema changes.
Schema updates on legacy RDBMS get even more complicated when a database has undergone sharding to scale it horizontally. It has become an ad-hoc distributed relational database whose inherent complexity makes it difficult to apply schema changes while keeping data consistent, much less doing so live and online.
Fortunately, a handful of cloud native relational databases now allow for truly live and online updates — meaning zero planned downtime. CockroachDB handles schema changes in progressive states, keeping the cluster free from table locks by ensuring applications only see the old schema version until the new one is fully deployed. Schema changes run as a scheduled background job while the database remains fully active and can service application requests without interruption.
This capability for updating a CockroachDB cluster schema fully live and online, even for massive production databases, is a built-in part of CockroachDB’s inherent architecture. Check out this demonstration of a live database schema update to see it in action.
Downtime, whether planned or not, is fundamentally challenging because of the costs and the real-world problems it causes. Be it email, chat, payment processing, or even customer support, downtime brings business to a grinding halt. Revenue is lost due to inaction or delay. Fraud becomes possible when critical data is unavailable in the moments that matter. Unplanned downtime can cause data losses and corruption that require an enormous amount of time, effort, and resources spent in data recovery. Finally, let’s not forget the legal and regulatory ramifications of mission-critical services being even temporarily unavailable for customers, employees and other stakeholders.
Downtime is never desirable. But now it is possible to build an application using an architecture that deeply reduces the risks of unplanned downtime while making planned downtime a thing of the past.
Major cloud platform outages, like GCP’s recent
europe-west9 full-region outage, are precisely the type of service …
As retailers gear up for Black Friday and Cyber Monday, they will not be caught flat footed by the tidal wave of …Read more
Disasters happen. (Most) outages shouldn’t.
When multi-billion dollar companies like Zoom, Slack, and Fanduel experience …Read more