This year, as every year, Black Friday and Cyber Monday stressed e-commerce systems to their breaking points. Major companies like H&M, Nordstrom Rack, and other retailers experienced the kinds of costly outages that keep SREs up at night. Multi-cloud infrastructure is sometimes offered as a panacea to these kinds of outages. But multi-cloud deployments are not a band-aid. In fact, they often introduce new complexities into the system that need to be sniffed out.
But sniffing out bugs in multi-cloud environments is, by nature, complicated. Ana Medina, a chaos engineer from Gremlin, spoke at ESCAPE/19 about how to do it, including a detailed list of the kinds of errors to search for and checklists of questions to ask.
Chaos engineers create tests to proactively expose bugs. In a multi-cloud environment, this means presuming that your cloud vendors are going to fail. A standard course of chaos engineering tests for multi-cloud architectures starts by stress-testing your cloud providers to their breaking points, and moves up the hierarchy of needs to things that are less crucial. Here's what Ana suggests you pay attention to in your chaos tests, in order of importance:
At the very core, you need to test whether the system functions. Ana suggests spinning up clusters to see how each cloud you use (including your bare metal machines, if applicable!) behaves in stressful situations. Make sure to measure the following outcomes:
When a Black Friday traffic overload incident inevitably occurs, you’ve run a fire drill for what happens when your provider goes down.
Another important chaos test: measure consistency between clouds. In Ana’s words, if a company can see ahead of time that data loss is a possibility in their current failover plan then they can adjust accordingly before actual customer data is lost.
You might be able to survive a data center outage by failing over to AWS, but get hit with a surprising bill after the fact.
Once the foundation of a multi-cloud infrastructure has been laid and tested, further chaos engineering can help you probe into the pricing during a failover. For example, Ana suggests you ask your team:
Failures are going to happen, whether it’s the fault of a vendor, a natural disaster, or some code glitch that got overlooked. In a multi-cloud environment, preparing for these failures with chaos engineering best practices will not only help you prepare for these failures, it’ll help confirm whether your multi-cloud strategy is working.
To see all of Ana’s talk (including a live demo of what chaos tests look like in Gremlin), her whole presentation is available here.
ESCAPE/19 hosted other speakers with expertise on securing services in multi-cloud infrastructure including Dan Papandrea’s talk about security in multi-cloud and Spencer Kimball’s talk about CockroachDB and the challenge of application data in global, multi-cloud deployments.
Visit this page to watch all the talks from ESCAPE/19.