Payments Processing: High Availability Without the Asterisk

Last edited on June 6, 2024

0 minute read

    The information in this article was current and accurate as of the publication date, June 6, 2024. Check with the cloud vendors directly for the most up-to-date information on their current uptime SLAs.

    When any large enterprise application processes millions of events per day, it is inevitable that something will eventually break. This is why, in software development, Murphy’s law is famous for obvious reasons: whatever can go wrong will, at some point, actually go wrong.

    Software engineering’s strategy for mitigating Murphy’s law is design for errors: build your application with the recognition that downtime is just going to happen, and include robust disaster recovery capabilities. That is a comfortingly pragmatic philosophy, but when your application’s main job is to process financial transactions and payments, you have zero tolerance for any kind of downtime or data loss. Ensuring the reliability and availability of the system is job one (also job two, job three, job n…) when software architects design payments platforms.

    Availability is so crucial for financial and payments services especially that one of the world’s largest banks created a “Never-down architecture” team to ensure their systems and applications, both internally and customer-facing, to ensure the highest possible availability at all times.

    What is high availability in a database?Copy Icon

    High availability refers to the ability of a system to remain operational and accessible even in the event of hardware or software failures, network outages, or other disruptions. In the context of a payments processing system, a highly available database ensures that transactions can be processed and recorded reliably, without interruption or data loss.

    When almost every RDBMS on the market claims to offer high availability, though, it’s important to understand exactly what their version of high availability really means in the context of your specific use case and business needs.

    Distributed database SLAs: AvailabilityCopy Icon

    “Never-down” should be something that is actually measurable, not just sales speak. One way to put data behind a database’s reliability claims is with SLAs: service level agreements from the vendor, which define the level of service they guarantee to supply your application at all times. SLAs cover many different types of services but, for any type of payments processing application, a database vendor’s availability SLA is by far the most crucial.

    Availability SLAs are the percentage of the time the database is operational. While the notion of a database that offers 100% availability is a pleasant daydream, even large and critical systems (like the VISA card payments network or Amazon Web Services, for example) don't promise 100% availability. CockroachDB, for instance, offers 99.999% uptime. This is recognition that when an application processes thousands (or even tens of thousands) queries per second, adverse events inevitably occur and will take down their system for seconds, minutes or even hours.

    When 100% uptime ≠ zero downtimeCopy Icon

    If a database promises you 100% uptime, that 100% SLA is almost certainly not a guarantee of zero downtime. It simply means that, in the event of any downtime, they will refund (or, most often, issue a service credit for) a percentage of your monthly bill.

    This is not to say that database providers put less than 100% of their effort into keeping their RDBMS service running like (atomic) clockwork. To be fair, detecting a database outage itself can be almost impossible until it’s too late. Deploying a distributed database in production requires dozens of dependencies on everything from networks and intermediary devices to cloud service providers, all of which can and will experience their own downtime. When this happens, how do you identify the point of failure? Does the issue come from a network connection dropping data, a service provider outage, or any of the many other possible intermediaries in between? Realistically, it’s often impossible to identify the exact cause of an outage.

    To cover exactly this situation, SLAs define a minimum outage level that must be triggered before their “guarantee” kicks in. There are SaaS providers on the market today promising extremely high levels of uptime, but read the fine print: That uptime SLA actually is promising only to refund you for any downtime that occurs — and that only kicks in once you experience a set percentage or greater downtime, per month, to qualify for any refunds or credits.

    That is over 20 minutes of outage time! Furthermore, if that service was down for 19 minutes, you would be unable to process payments or transactions, resulting in lost revenue and damage to your reputation. But, because the outage (or cumulative outages over that one-month billing period) did not add up to the 20 minute threshold, the vendor wouldn’t be responsible for compensating you for any of that downtime — SLA or no SLA.

    This is why you need a database where the availability SLA really means high availability. What does this mean, by the numbers?

    High availability that is really highly availableCopy Icon

    The high availability benchmark offered by most distributed SQL RDBMSs is 99.99%, or four nines of uptime. That looks like a lot of nines, but it equates to 52 minutes, 9.8 seconds of downtime per year. Can your payments processing system survive that much time offline?

    You still need to read a four-nines SLA closely, though, because for many distributed SQL database solutions, even those four nines of uptime come with an asterisk. As of the date of publication, Amazon Aurora, for example, offers a 99.99 SLA – for multi-AZ deployments. If your app is a single-AZ deployment, though, the SLA drops to three nines of uptime: 99.9%, or 8 hours, 41 minutes, and 38 seconds of downtime per year. It’s a dramatic difference!

    Fortunately, there are at least two distributed SQL databases currently offering 99.999% uptime (5 minutes, 18 seconds per year): Google Spanner and CockroachDB.

    Both are globally scalable, synchronously replicated cloud native distributed relational databases that offer extremely high uptime SLAs out-of-the-box. Spanner, however, can only run on Google Cloud — introducing a single point of failure in the event of a serious GCP outage. CockroachDB, on the other hand, is cloud agnostic and can be deployed on any of the major cloud providers – or all of them, in the case of self-hosted multi-cloud deployments – as well as self-hosted on premise and hybrid cloud/on-prem installations.

    High availability by the numbersCopy Icon

    How distributed SQL database availability SLAs stack up (as of June 6, 2024).

    High availability without the asteriskCopy Icon

    A high-scale, distributed, mission-critical system powering payments needs a high availability database for the maximum possible uptime. The best option to deliver five nines of uptime in multi-region deployments, with guaranteed ACID transactions and the ability to run anywhere and everywhere, is CockroachDB.

    The numbers tell the true SLA tale…until those pesky asterisks pop up. CockroachDB is the no-asterisk, distributed SQL database solution for always-on applications and always-accurate payments processing applications and services.

    Learn more about how CockroachDB delivers secure and always-available payment experiences. See how our customer Shipt, a grocery ecommerce company owned by Target, maintains a crucial suite of payment services in our guide, "From Resilience
to Scalability: 12 Mission-Critical CockroachDB
Use Cases."

    Want a webinar? Watch "How to build a scalable payment system" payment expert Andy Kimball (ex-Square), Engineering Operations Lead at Cockroach Labs, joins Cockroach Labs Sales Engineer Jim Hatcher to discuss payment system requirements.

    database SLAs
    distributed database SLA
    zero downtime
    high availability