Businesses accept higher customer and economic risk when they live with disaster recovery plans that don’t meet a zero recovery point objective. With new advances in technology and products like CockroachDB, businesses no longer need to choose higher risk in order to manage complexity.
Why RPO and RTO are so Important to Business Success
Businesses care deeply about ensuring the resiliency and uptime of applications. Any downtime could directly impact top line revenues, hurt brand perception, and divert valuable resource hours to failure recovery processes. As a result, CEOs, CIOs, and top level technology executives are focused on meeting application uptime goals and minimizing the cost of infrastructure-level failures. For DBA teams, this means defining and meeting recovery point objectives (RPOs) and recovery time objectives (RTOs) for different tiers of applications.
A Recovery Point Objective (RPO) marks how much data can be lost when a failure occurs. A non-zero RPO means that any committed transactions that occurred between the RPO and failure time could be lost. A Recovery Time Objective (RTO) defines how much time it should take to recover from a failure. A non-zero RTO results directly in application downtime. On an ecommerce website, for example, this could mean losing minutes (if not hours) of customer transactions resulting in lost revenue.
It’s important to note that RPO and RTO work in tandem. In the case of a fast recovery with non-zero RPO, businesses have to either try to manually reconcile their accounts or live with data loss. On the other hand, a zero RPO solution that recovers slowly would result in significant application downtime. An optimal disaster recovery plan needs to take into account the required RPO and RTO for a given application.
For mission critical applications, businesses need to get as close to zero RPO and RTO as possible to minimize the overall risk to both the business and their customers. An application that handles financial transactions with a non-zero RPO could lose deposits or transactions. A reservation system could lose customer reservations. Even worse, losing patient data in real-time healthcare systems could directly impact patient safety.
Businesses Settle for Non-Zero RPO and RTO
Meeting zero RPO and RTO is incredibly complicated. Several architectural layers contribute to RPO and RTO including database systems, clustering technology, data replication solutions, and storage replication. Each layer is a separate product that must be integrated, configured, and set up by the customer. This means that each layer needs a team of experts to set up, manage, and maintain the system. Often, the combination of products and the way the products are configured become unique to a customer.
This entire discussion assumes that existing databases can actually achieve zero RPO and low RTO, but in many cases, this is not true. Active-active setups are supposed to continue serving traffic without data loss in the case of datacenter level failures, but in practice, messages can be lost in transit between datacenters. Further, they rely on a timely detection of the failure to trigger recovery (view figure 1). Standby setups have a similar problem with lost messages during detection of failure and recovery.
In contrast to active-active and standby setups, NoSQL solutions can run on more than two servers, providing higher availability and scalability. NoSQL has built-in replication, which means that businesses don’t need a separate solution to support replication or clustering. However, NoSQL comes with its own hidden cost. Although it can survive failures, eventual consistency means that stale data and split brain situations could occur leaving DBAs with inconsistent data that they have to reconcile. Even though downtime is reduced, the data contained in the database is either stale or incorrect. Further, recovery time from disaster scenarios in NoSQL can sometimes take days, since data needs to bootstrapped and repaired to ensure that the data is usable and up to date.
CockroachDB Provides Zero RPO
CockroachDB makes meeting required SLA targets viable and cost-effective by wrapping the complexity of building a highly resilient infrastructure into a single product. It reduces the component complexity of IT resilience by 75% by eliminating the need for separate replication, clustering, and storage solutions in order to achieve fault-tolerance. Instead, everything comes built into the database system software, reducing the cost and complexity associated with purchasing, deploying, and managing multiple solutions from multiple vendors. Unlike NoSQL, it provides ACID guarantees through consensus-based replication, so that data is always consistent and committed transactions are guaranteed to persist. CockroachDB can also be deployed on commodity hardware, since it has built-in resiliency for storage-level failures at the software layer. More detailed description of each of the layers is available here.
Underneath the covers, CockroachDB intelligently replicates data across the cluster, spreading copies out across different availability zones to provide the highest level of fault tolerance based on the available infrastructure. This means that for any hardware failure ranging from disk to datacenter-level disasters, CockroachDB can continue to serve client traffic while recovery takes place. CockroachDB is also architected to support an average of 4.5 seconds RTO. It is important to note that this includes both the time it takes to detect a failure, as well as the time it takes to recover from it. No other database vendor can provide these guarantees along with the ease of use and operational simplicity of CockroachDB.
Business Should Stop Accepting Anything Less than Zero RPO
IT leaders tasked with the difficult mission of shipping products faster while managing cost and risk have historically had to make trade-offs between protecting their data and the cost of doing so. With new database technologies like CockroachDB, IT leaders are empowered to make zero RPO a baseline requirement for all core business applications given the high cost and risk associated with data loss. Finally, IT leaders can reduce the complexity of their data architectures while reducing risk, freeing them up to build reliable and innovative products quickly.
Illustration by Tsjisse Talsma