As organizations migrate to the cloud, they need a cloud-native, relational database to help them move all their applications to this new environment.
Over the last ten years, the infrastructure that runs our applications has fundamentally changed. As we move to the cloud, we now have to think about managing workloads in an environment where we don’t have tight control over the infrastructure that hosts our applications. Things like “how can I recover my data?” and “what happens if my instance fails?” are very different when you host applications in the cloud. And we’re doing it at a scale that was once only considered by a handful of companies worldwide.
Largely, these changes have been positive. Launching a new technology no longer requires months of planning, procuring, and commissioning hardware. Our individual apps have become a combination of micro-services that work together and can be set up to scale dynamically as traffic demands. We can try out something new without having to convince management that it’s worth a capital investment. However, when it comes to data, it hasn’t been easy to make the transition.
The large companies that were first to move to the cloud made some difficult decisions to sacrifice functionality to create resilient, scalable databases. NoSQL databases like Apache Cassandra(™) are an example of this - extremely performant, scalable, and resilient to failure. But these NoSQL databases are limited to very simple query operations and a model that can’t guarantee consistency in all but the simplest of cases to make that possible. So to work around that lack of functionality, those “micro”-services tend to grow in scope and complexity because they have to implement data management logic in the service layer.
If you’re rewriting your app from the ground up, and you have an army of developers that are familiar with complex data management principles, that might be okay, but could also prove costly. And worse yet, data has gravity - it’s hard to move, so you need to have a solution that allows you to transition to and from an on-premise environment, as well as between clouds. If you wait until you have to move to think about this, it may be too heavy to lift.
So, what exactly do you need to bring the next evolution of the transactional database into the cloud? Number one is full SQL support - it’s the lingua franca of databases everywhere, and the way we’ve all been trained to ask questions of our data. From there, we need to guarantee that data will always be a reliable and correct state, no matter the circumstance - that means full ACID compliance. Beyond that, we need to be able to scale the database without adding operational complexity. We also need to keep in mind where that data should live - for example, European data needs to stay in the EU, or information about someone who lives in NYC should primarily be located on the east coast for easy access. And since we can have data spread across multiple regions (and even all over the globe) geo-locating data to where it needs to be accessed or written is also critical for optimal performance. All while staying prepared for nodes, availability zones, or even entire regions from a cloud provider going down at times for reasons outside of our control.
Full SQL support is the first thing that we lost as we shifted data management to the cloud. To make it easier for people who were developing these first cloud-native data management systems, we reverted to earlier design patterns that are easier to scale - namely, what we now call key-value stores. This pattern uses a “key” to file a piece of data away and then retrieve it using that same key in some predictable fashion. This makes it easy to scale because based on the key I can predict which machine I squirreled the data away on, but it makes it difficult to answer higher-order questions with that data. That logic has to get applied at the application level. Which means every time I need to answer a new question, I have to write the code that can do it - something I haven’t had to do since the late 1970s.
Databases like Apache Cassandra(™) try to bridge the gap by bringing some SQL-like syntax to the mix, but you run into a problem. It “feels” like SQL, but it doesn’t act like SQL. And SQL is what you need to take an application that includes business logic, but no data management logic, and migrate it to the cloud. Without that SQL layer, you’re engaging in a costly rewrite of your application that in many cases isn’t worth the investment.
Over the past 40 years, we’ve loosened the ACID standard to get better performance on single-node relational databases. Largely, we’ve eased up on the consistency (different copies of the same data are always the same) and isolation (multiple queries aren’t allowed to mutate or leverage the same data at the same time) requirements to allow databases to process more queries at once. We could get away with this in most cases because individual queries run extremely quickly, and local replication of data is blisteringly fast on a single node (or in some cases in a single datacenter). But as we try to move those same systems to the cloud and distribute them across environments, the time windows where a system is vulnerable to those worst case scenarios increases from microseconds to seconds (or even in some cases, minutes), all because you might have multiple servers trying to field the same queries.
NoSQL databases handle this by simply not guaranteeing that you’ll get consistent results - leaving your application to handle cases of dirty reads and conflicting writes. Once again, this is logic you now have to write in your application, complicating your migration. A true RDBMS handles these cases for you with full, ACID-compliant transactions. This allows you to have the confidence in your data that you need to run your business - financial records and all.
A true, cloud-native solution should take advantage of the inherent advantages of moving to the cloud. Namely, the ability to scale; not just in failure scenarios but under different load scenarios as well. If you’ve ever tried to scale a traditional RDBMS, you know it’s a complicated endeavor. Your best case scenarios allow you to spend an enormous amount of time and effort to manually shard your data, or some form of a read-ahead cache that won’t be guaranteed to be consistent with some master database. And while the best NoSQL solutions like MongoDB(™) and Apache Cassandra(™) can definitely be considered scale-out systems, both are difficult to scale back if you decide you no longer need some of that capacity.
On top of that, when your data isn’t guaranteed to be consistent like it is with NoSQL solutions, you don’t know if the node being removed from your cluster is the only one with the latest version of a record. That not only impacts scale, but it also makes your writes less durable in cases of failure. Losing a single database node should never cause data to be lost. A well designed cloud-native environment should never lose data under any failure scenario. But it’s easy to imagine a case where that could happen with a NoSQL database because we simply don’t know the state of the local data.
Moving your data to the cloud doesn’t mean it can live just anywhere. Not only do regulations like GDPR come into effect when talking about customer data, but from a purely practical standpoint, it’s important to keep it close to where it’s being accessed. That means being able to logically define rules for where a record should and shouldn’t live. And not just at an object level, the database should be able to tie data to a location while still being able to treat the entire corpus as a single table to keep that complexity from our applications.
Luckily, there’ve been some technological advancements over the past decade that finally allow you to bring that rich set of RDBMS functionality to this cloud-native environment. CockroachDB is the culmination of that research - a transactional, scalable, cloud-native relational database with full SQL support and a consistency model that guarantees you will always get the correct answer. CockroachDB is the last piece that was necessary to help the next wave of applications to transition to the cloud.