Santander: Hacking human error to achieve operational resilience

If there is one inevitable truth in technology, it is that disasters happen. Calamities like fires, climate change-driven disasters, or major cloud provider outages share one thing in common: you can’t control them. What you can control, however, is how you handle them when they happen. You can practice resiliency.

At RoachFest23, CockroachDB’s most recent customer conference, Thomas Boltze (Head of Cloud and Engineering Excellence with Santander) shared the core tenets of resilient systems, and how to practice them in the real world. These are his words.

Root causes

When we talk about failures and outages, we talk about root causes. But what are we not talking about? That the number one root cause of system failures is human error.

When something goes wrong, it’s not your hard drive failing. Somebody did something they should not have done. Someone decided it was a good idea to build the data centers for two different regions right next to each other, which was fine until there was a fire. Or your hard drive runs full, your server stops working, but you didn’t know because you didn’t have alerting in place. Maybe you had monitoring, but your monitoring wasn’t good enough to pick up the fact that during a rolling deployment, blue-green, the deployment just destroyed your service and keeps deploying until your entire service is down. What do all of these disasters have in common? Humans are at the root. And humans are fallible.

Outages happen all the time, for all kinds of reasons, and often there’s very little you can do about them when they are happening. But you, as a human, can do something about how you handle them when they happen: you can practice resiliency.

Resilience in the real world

Resiliency is the capacity to withstand or to recover quickly from difficulty and to overcome adversity. But what does “good” resiliency look like in the real world?

When evaluating your own systems for their ability to withstand outages and failures, these are the questions to ask yourself:

Can your systems withstand AZ / datacenter / region failures?
Can your systems withstand loss of core services?
Do you run game days?
Do you run chaos monkey – in production?
Do you stress test your systems on a regular basis?
Are your change sets small?

Being able to answer these questions in the affirmative is how we achieve resiliency in computer systems. But getting there requires (you guessed it) lots of human work.

Hacking humans to build resilient systems

Fortunately there are human resiliency hacks that are available to everyone: Mindset, culture, observability, curiosity, everything as code, focus, and shared responsibility. Humans love stories, right? This is the story of how all of these factors came together, over time, to build a truly resilient payments system at Santander.

When I first joined the Payments Hub, there were three availability zones. There was Pivotal Cloud Foundry. There were RDS instances in multiple regions, there were Redis MQ clusters across multiple AZs. And people thought it was good, but it wasn’t. They thought, oh, if a failure happens, the system will recover on its own.

So I said, “OK, then, let’s test this” and we did our first disaster recovery test – not in production, of course. We ran loads through it, snipped down an AZ. Then we spent the next six hours cleaning up the mess that we made. The team said, like, “Holy shit, how could this happen? What happened here?” They found a few things. They fixed a few things. Then we ran the same test again. You know what happened? Same thing. Because this time we took down a different AZ, and one of the master nodes that happened to be in a clean AZ the first time was now on a failing AZ. And so we had another massive cleanup. People started to realize that this is not going to be easy. This system does not heal itself. Our infrastructure does not heal itself. It doesn’t recover on its own. It’s a lot of work. Then we spent the next six months or so testing this every few weeks, and every few weeks we found a new thing that failed, and we found a new thing that failed, and we found a new thing that failed. And we fixed them and fixed them and fixed them. Then, Amazon decided to disaster test for us in production. They had what they called a thermal event – a fire – in the data center, and that took down an AZ, boom.

If that had happened six months before, we would’ve been in a bad place. This time the systems kept processing the payments. Now we can now survive a AZ-wide outage. But that’s not good enough! We are processing payments. They’re essential, systemic services that cannot fail. People need to pay their taxi, pay for their coffee, their shopping bill – they can’t wait five hours until we restore service. That’s not an option.

So we’ve launched MVP for multi-region, so we can survive more or less multiple region-wide outages now. Once we have that in place, the next logical step is then let’s go to multiple clouds. Yes, it’s work, but we know how to do this. So, if Amazon has a two-region outage… Never happens to Amazon. Right? Never ever. Apart from the time that they took down two regions simultaneously from DNS failure, of course. Events like these are just going to happen, and this is why our systems are designed explicitly to be resilient to these sorts of failures. Multi-cloud is your answer there, running active-active across as many as you can. It’s hard work, but it’s necessary.

Check out Thomas Boltze’s full RoachFest presentation below to hear the rest of the story: how shifting mindset and cultivating curiosity at Santander (not to mention automating everything possible) led to a culture shift. This culture shift led to shared responsibility, partnerships, and focus to create a truly resilient team that is able to deploy a fully resilient system. One where nobody ever has to dig in their pockets for cash to pay for their coffee because the payments system has gone offline — and, really, who carries cash anymore?