How to run chaos tests in a multi-cloud environment

Last edited on December 9, 2019

0 minute read

    This year, as every year, Black Friday and Cyber Monday stressed e-commerce systems to their breaking points. Major companies like H&M, Nordstrom Rack, and other retailers experienced the kinds of costly outages that keep SREs up at night. Multi-cloud infrastructure is sometimes offered as a panacea to these kinds of outages. But multi-cloud deployments are not a band-aid. In fact, they often introduce new complexities into the system that need to be sniffed out.

    But sniffing out bugs in multi-cloud environments is, by nature, complicated. Ana Medina, a chaos engineer from Gremlin, spoke at ESCAPE/19 about how to do it, including a detailed list of the kinds of errors to search for and checklists of questions to ask.

    Maslow’s Hierarchy of Multi-Cloud Chaos TestsCopy Icon

    Chaos engineers create tests to proactively expose bugs. In a multi-cloud environment, this means presuming that your cloud vendors are going to fail. A standard course of chaos engineering tests for multi-cloud architectures starts by stress-testing your cloud providers to their breaking points, and moves up the hierarchy of needs to things that are less crucial. Here's what Ana suggests you pay attention to in your chaos tests, in order of importance:

    1. Basic Functionality: Does your failover system work?Copy Icon

    At the very core, you need to test whether the system functions. Ana suggests spinning up clusters to see how each cloud you use (including your bare metal machines, if applicable!) behaves in stressful situations. Make sure to measure the following outcomes:

    • How does each cloud handle failure?

    • If a host is shut down, how long does it take to spin up another?

    • How long does it take for the monitoring to catch this exchange?

    • Are the clouds located somewhere that makes sense for the applications in question?

    • Is the multi-cloud control plane working?

    When a Black Friday traffic overload incident inevitably occurs, you’ve run a fire drill for what happens when your provider goes down.

    2. Data Consistency: Is your data correct across clouds?Copy Icon

    Another important chaos test: measure consistency between clouds. In Ana’s words, if a company can see ahead of time that data loss is a possibility in their current failover plan then they can adjust accordingly before actual customer data is lost.

    • If your primary is shut down on the first cloud provider and you have your replica on a primary in the second cloud provider, is Cloud 1 primary consistent with Cloud 2 primary?

    • If you shut down the primary in one of them does the replica come back as primary without suffering any data loss?

    • How does latency affect the connection between clouds?

    • Is your cache layer working properly?

    3. Cost: What does your failover multi-cloud pricing look like?Copy Icon

    You might be able to survive a data center outage by failing over to AWS, but get hit with a surprising bill after the fact.

    Once the foundation of a multi-cloud infrastructure has been laid and tested, further chaos engineering can help you probe into the pricing during a failover. For example, Ana suggests you ask your team:

    • What are our compute costs in different failover scenarios?

    • Are any SLAs in danger, and what might those cost in fees?

    • Are we hitting limits on our cloud providers?

    Use Chaos Engineering to Verify Your Multi-Cloud Strategy Copy Icon

    Failures are going to happen, whether it’s the fault of a vendor, a natural disaster, or some code glitch that got overlooked. In a multi-cloud environment, preparing for these failures with chaos engineering best practices will not only help you prepare for these failures, it’ll help confirm whether your multi-cloud strategy is working.

    To see all of Ana’s talk (including a live demo of what chaos tests look like in Gremlin), her whole presentation is available here.

    ESCAPE/19 hosted other speakers with expertise on securing services in multi-cloud infrastructure including Dan Papandrea’s talk about security in multi-cloud and Spencer Kimball’s talk about CockroachDB and the challenge of application data in global, multi-cloud deployments.

    Visit this page to watch all the talks from ESCAPE/19.