What is operational resilience and how to achieve it

Major cloud platform outages used to be rare events. As the amount of global data increases exponentially, however (90% of the world’s data was generated in the last two years alone!) significant outages are becoming increasingly common.

The potential user impact of cloud service provider (CSP) outages hits both deep and wide. For example, GCP’s europe-west9 full-region outage in April 2023 and AWS’s us-east-1 outage in June 2023 each temporarily disrupted operations for the businesses, schools, hospitals, and even government agencies relying on their services in those regions. And these are but two among many recent CSP downtime events: Data shows a steady rise in total observed global network outages in 2023 so far.

The growing frequency of outages like these keeps the world’s government officials up at night. One of their deepest concerns is for the potentially catastrophic impact a major CSP failure could have on financial institutions — and the very real-world damage this could do to their economies.

Concern is increasingly turning to action as different countries propose technical requirements aimed at ensuring operational resilience for their financial institutions (as well as other critical services like utilities, transportation, and healthcare). Technical requirements that are already becoming formal government regulations in some countries, with many more on the horizon. What does this mean for all organizations right now, and what can they do to prepare?

What is operational resilience?

Operational resilience refers to an organization’s ability to adapt and respond to disruptions or unexpected events while maintaining continuous operations, delivering products and services to customers without interruption (and, ideally, without them even noticing).

Achieving operational resilience involves identifying, analyzing, and managing operational risks, like cyber attacks, natural disasters, supply chain disruptions, and (most of all) technical failures. Google Cloud’s recent europe-west9 outage (April 2023) took down GCP’s entire europe-west9 region for a full day, with zones and services coming back gradually over the course of several days. The incident – the result of a fire and subsequent water damage in a Paris co-location data center – also triggered a four-hour outage of Google’s Cloud Console and GCE Global Control plane services worldwide.

Entire region outages are rare, but they can – and do – happen to every cloud provider, with potentially disastrous consequences. Let’s examine another full region outage in recent history to understand just what happens when an entire cloud platform region goes down.

What is fault tolerance, and how to build fault-tolerant systems

The December 7, 2021 incident that took out AWS us-east-1 for eight hours illustrates the classic sysadmin haiku It’s not DNS. There is no way it’s DNS. It was DNS. An internal DNS and monitoring systems failure, triggered by traffic congestion after an automatic network scaling operation went awry, caused a fatal cascade of connection errors and retries. The impact was immediate and widespread, affecting millions of users in us-east-1, AWS’s largest (with five zones) and most heavily trafficked region. Scores of websites and services, from granular to grandiose, went down: Roombas sat idle, and schools were forced to call off classes and exams after losing access to educational platforms. Food deliveries were suddenly canceled, and even some of Amazon’s own operations ground to a halt.

That particular Tuesday was tough for those of us who had our Whole Foods orders evaporate – but what if that outage had also taken down one of America’s biggest banks for those eight hours? Millions of customers suddenly would be unable to access their money or use their credit/debit cards. All the businesses relying on that bank to process their transactions could have been paralyzed. The dent in the US economy that day would have been deep indeed.

The thing is, this could really happen. This particular outage was noteworthy for its far-reaching and very public impact, but the truth is that the cloud providers experience outages and service failures all the time. Most are limited in scope and quickly resolved, but major outages eventually — and inevitably — do happen.

Why operational resilience is important

Outages like this one don’t just anger users. They underscore the severity of the disaster that can arise from having so much economic activity reliant on technology from just a few vendors. They also are a vivid illustration of why operational resilience is important.

There is one business sector in particular where a major service failure would have a particularly disastrous impact: the financial sector. If a banking institution were to experience a service disruption, conducting transactions would become impossible and life would grind to a halt for every customer, be they consumers or businesses. The understanding, from both banks and the governing bodies in countries where they operate, is that while serious cloud provider outages are uncommon, they are also basically inevitable.

Initial real-world evidence of this reality, among financial services organizations in particular: job listings by banks looking to hire “Never Down Engineers” and articles about never down architecture. The awareness that detailed planning for surviving the worst outages, however rare, was a business essential became widespread across all sectors, giving rise to the concept of operational resilience. Operational resilience was initially viewed as part of business continuity planning, to be handled privately by individual companies. However, the stakes are now high enough in some sectors that governments have begun stepping in.

“Financial market infrastructure firms are becoming increasingly dependent on third-party technology providers for services that could impact the financial stability of the UK if they were to fail or experience disruption,” said UK Deputy Governor for Financial Stability John Cunliffe in a joint announcement made by Bank of England, Prudential Regulation Authority (PRA) and Financial Conduct Authority (FCA) describing potential resilience measures for critical third party services.

Case in point: Data collected for a 2020 Bank of England report indicate that two thirds of UK financial institutions rely on one of two major cloud providers. Financial institutions are some of the most technologically savvy organizations in the world, adopting risk mitigation strategies that include running hybrid cloud infrastructure and multi-region infrastructure; in the case of most outages, service to their customers would be unaffected. However, the black swan event of a full regional outage for a cloud provider — due to extreme physical climate conditions, a cyberattack, or a fire at a major data center — can potentially disrupt the entire country’s financial system.

In other words, a cloud region outage taking out London’s eu-west-2 / europe-west2 / UK-south region (depending on the provider) — even temporarily — could have a catastrophic impact on the UK’s economy. This nightmare scenario is the driving force behind operational resiliency laws.

Operational resilience vs business continuity

Operational resilience and business continuity are closely intertwined, but they are not the same. Operational resilience is about ensuring your mission critical application can handle whatever digital chaos comes its way, up to and including a cloud service provider experiencing a global outage. The point of operational resilience is, if an outage strikes, your end users will never even know.

Business continuity, on the other hand, is a subset of operational resilience. Business continuity primarily focuses on ensuring that critical business functions can continue during and after a disruptive event.

Think about these in terms of playing a video game. You’re in the final boss battle (talk about mission critical!) and suddenly the game crashes. Business continuity is the equivalent of being able to go back to the last save point and pick up close to where you left off. Operational resilience is where the game experiences the same glitch but, since the software was architected for zero downtime, you never noticed there was any kind of issue because your gameplay was never interrupted.

Operational resilience regulations

Governments around the globe are taking legislative steps to impose technical requirements for financial institutions to reduce risk, but the UK is leading the way in holding financial firms responsible and accountable for their operational resiliency. One of the keystone legislative requirements: regulators have instructed financial firms to meet operational resilience requirements, overlaying governmental oversight on top of internal decision-making. So long as the results meet the required minimum level of operational resilience, CIOs are able to choose from scenarios that best suit their needs. Hybrid cloud (operating an additional physical data center to supplement their primary cloud infrastructure) and multi-cloud (running on multiple cloud provider platforms) are two of the options for satisfying these requirements.

Other countries are also pursuing legislation and regulatory initiatives to improve operational resilience in their financial sectors. One of the most significant is the European Union’s proposed Digital Operational Resilience Act (DORA), which seeks to ensure that all financial market participants have effective strategies and capabilities in place to manage operational resilience. DORA is expected to apply to all digital service providers, including cloud service providers, search engines, e-commerce platforms, and online marketplaces, regardless of whether they are based within or outside the EU. DORA entered into force in January of 2023. With an implementation period of two years, financial entities will be expected to be compliant with the regulation by early 2025.

Operators of essential services (like utilities, transportation and logistics companies, and healthcare providers) have already been required to meet standardized security and network regulations. Further increasing operational resilience by requiring additional data centers, deploying in multiple regions, and even deploying across multiple clouds is widely seen as the logical next step on the UK’s regulatory agenda. A step other countries are likely to pursue, following in the UK’s wider-sector footprints here as they have when it comes to financial services companies.

In the United States, the Federal Reserve, the Office of the Comptroller of the Currency (OCC), and the Federal Deposit Insurance Corporation (FDIC) have also released a whitepaper, Sound Practices to Strengthen Operational Resilience. The paper provides detailed guidance on operational resilience, emphasizing the need for financial institutions to identify and manage risks associated with their critical operations, but for now it is just that — guidance. Now, however, the FTC has issued a Request for Information seeking public commentary “about the competitive dynamics of cloud computing, the extent to which certain segments of the economy are reliant on cloud service providers, and the security risks associated with the industry’s business practices” (emphasis added). Analysts see this as early evidence that the US is also beginning to consider technical regulation and requirements for critical services to ensure operational resiliency.

The pattern seems quite evident: such regulations will only expand in number and scope over the coming months and years, and are almost certain to involve most sectors and localities eventually. What can be done now to increase operational resilience, even before they become the law of the land?

How to achieve operational resilience

The unexpected truth of operational resilience is that it doesn’t matter how many cloud providers you run on, or which one(s), or even that one of your potential backups runs on bare metal. What truly matters is hardwiring operational resilience into an application architecture.

Most orgs are built to run on a single cloud provider, because that generally has been the most economical and straightforward path for almost every use case. Until now, with operational resiliency regulations looking increasingly inevitable, most have not had a compelling reason to ponder a multi-cloud strategy. Which means that they have not needed to deeply consider or even consciously recognize the technical implications of a single provider strategy.

The common belief is that your cloud provider is just giving you a platform, and you’re just building on top of it. The logical conclusion is that, then, it would not be a big deal to basically lift and shift your application onto a second cloud provider, it should just require rewiring some network connections and APIs. The reality, though, is that every service you use, every piece of your application —whether it’s native or third-party, custom-coded or open source — also has to talk with this new platform. But each cloud provider has a different proprietary way for how each service within your app needs to communicate with it.

Instead of a straightforward lift and shift, your entire application must be rewritten to this new platform’s own unique proprietary standards. This reality of needing to validate different standards for each cloud provider makes achieving operational resilience through a multi-cloud approach achingly complex.

For example, using Kubernetes for workload portability is a commonly accepted application architecture best practice. But consider that your Kubernetes operator needs to be equally portable. When initially architecting your app, the path of least resistance and greatest convenience is GKE or EKS or whatever your provider’s native solution is — right up until the moment it’s time to change to or add a different cloud provider, or even to go hybrid with a physical data center. Then, suddenly, K8s can seem anything but portable.

The same is true for every component of your application. As another example, you may think, “We use open source PostgreSQL on Amazon.” But, well, actually you don’t. You use RDS PostgreSQL, which has a proprietary API and has a proprietary way for you to talk to it. And that is true for every service on Amazon. You might think you are using open source, but no — you’re using the proprietary Amazon API in front of those open source projects.

Hardwiring operational resilience into your application, then, means making every every piece of your application architecture to be platform agnostic. A cloud-agnostic application architecture can allow for easier scalability and flexibility. As the application grows, different services or platforms can be added or replaced without the need for major code changes. By ensuring interoperability of applications across different cloud service providers, satisfying any operational resilience regulations that eventually arise will be straightforward.

We are reaching the end of the “single cloud provider as automatic best practice” era. Centering your application architecture on cloud platform-agnostic services and tools has become an essential survival strategy. Beyond avoiding vendor lock-in, doing this is also the key to inherent operational resilience. The time is coming, and maybe sooner than we think, when legislators and regulators will act to regulate the way individual companies approach operational resilience, all in the name of public good.