What is distributed SQL? An evolution of the database

What is distributed SQL? An evolution of the database

SQL for App Devs

Learn SQL fast and focus on what app devs need to know, from schema design to writing transactions that perform at scale.

Start learning

Originally, SQL databases were designed to be system-of-record workhorses, built for data consistency, vertical scalability, and tight integration. They did this very well for the long era of monolithic applications running on dedicated, single-server environments. When the paradigm shifted to distributed applications in the cloud, however, businesses began to find that the relational databases behind some of their most critical applications were holding them back. Why?

The architecture of these original relational databases frequently conflicts with the architectural needs of modern cloud applications: distributed computing, elasticity, and microservices. Now, applications need to scale horizontally across multiple servers or containers, adapt to ephemeral infrastructure, and support rapid deployment cycles. Relational databases, with their fixed schemas, licensing constraints, and lack of support for distributed data models, simply are not suited to the cloud. 

For a time, this was a real conundrum: organizations need the power and reliability of a tested relational data store, such as Oracle, SQL Server, Postgres, and MySQL, but with the benefits of inherent scale and global coverage that comes with the cloud.

Some have turned to NoSQL stores to try to meet these requirements. These alternatives can typically meet the scale requirements but then fall short as a transactional database because they were not designed from the ground up to provide true consistency. Some NoSQL solutions have begun to offer “ACID transactions” but they’re full of caveats and fail at delivering isolation levels necessary for mission-critical workloads like a financial ledger, inventory control, and identity management. For many critical workloads, though, “eventual consistency” does not suffice.

Distributed SQL is a new breed of database

Some of the world’s most successful companies were the first to tackle the problem, creating purpose-built databases able to function at global scale. The most public example of this is Google Cloud Spanner. In 2012, Google published a paper that demonstrated a new way of looking at the relational database, one that was rooted in distributed systems and global scale.

“Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions.”
Spanner: Google’s Globally-Distributed Database

There is a lot wrapped up in the description and also a 14-page paper that goes into explicit detail of how they were able to build a consistent AND horizontally scalable database. The paper is pure genius and outlines the foundation of the next evolution of the database: Distributed SQL.

Distributed SQL is a single logical database deployed across multiple physical nodes, whether in a single data center or across many data centers distributed around the globe, that delivers inherently elastic scale and bulletproof resilience.

What is a distributed SQL database?

Several attempts have been made to deliver truly scalable SQL in a distributed environment. Some have tried to retrofit existing databases to meet their needs but this ultimately does not deliver on the promise of a truly distributed SQL database. So then, what makes up a distributed SQL database? There are five core conditions:

1. Scale

A distributed SQL database must seamlessly scale in order to mirror the capabilities of cloud environments without introducing operational complexity. Just as we can scale up compute without heavy lifting, the database should be able to scale as well. This includes an ability to evenly distribute data across multiple distributed participants in the database.

2. Consistency

A distributed SQL database must deliver a high level of isolation in a distributed environment. In a cloud-based world where distributed systems and microservices are the default architectures, transactional consistency becomes difficult when multiple operators may be trying to work on the same data. A true distributed SQL database should mediate contention and deliver the same level of isolation for transactions expected in a single instance database.

3. Resiliency

A distributed SQL database must naturally deliver the highest level of resiliency without any need of external tooling to accomplish this. The cloud presents an always-on environment for our workloads; a database should have the same properties. With a distributed database we can reduce the time it takes to recover from a failure down to near zero and replicate data naturally without any external configuration.

4. Geo-replication

A distributed SQL database should allow for distribution of data throughout a complex, widely dispersed geographic environment. The cloud presents an ability to reach every corner of the globe with an acceptable quality of service and the database should not restrict your applications from doing so. It should perform to meet your expectations

5. SQL

And while these four technical requirements are paramount, there is one key prerequisite above all. The database must speak SQL. It is the language of data and the default for all application logic. We should not have to retrain developers to use the database. They should be able to use the SQL dialect they are already familiar with.

There are a few databases that meet these requirements. The list includes Spanner, of course, but you could also consider Amazon Aurora, Yugabyte, FaunaDB and CockroachDB as members of this new category. All of these members meet the requirements in some form, some better than others.

Noticeably missing from this list are Oracle, Postgres, MySQL and all of the NoSQL options. While each may meet some of the requirements, none of them meet all of the requirements.

Distributed SQL databases must offer data locality

Once you live in a distributed world, it becomes apparent that the database itself could actually take care of domiciling data. With participants located in various regions or data centers, it becomes possible to understand the location of each and then tie the data that it stores to a location. Some application architects have implemented this as part of an application but this approach is error-prone and brittle. Using the database to geo-partition data based on some field in a table is a new requirement for distributed SQL. This allows you to use the database to address data sovereignty concerns. It can also be used to have data follow a user so you can ensure low latency access to their information or to tie data to an explicit cloud so you can minimize egress charges.

Truly distributed SQL databases must be multi-cloud

A unique trait of a distributed SQL database is that it has semi-autonomous units that all participate in a larger system. Each unit should be able to be deployed by itself and then join the larger system, the CockroachDB cluster. This is an inherent trait that fuels the first five requirements listed above. However, this can also be used to extend the database to be truly multi-cloud. The database should not rely on a single network to accomplish distribution. It should be divorced from these limits so that a participant can be located anywhere, from any public cloud, a private cloud and even a single on-premise instance. This requirement is important for the future of compute where we live in a distributed hybrid and multi-cloud world.

Foundational requirements for distributed SQL

While the aforementioned seven requirements are unique to distributed SQL (well, all of them except the SQL thing), it is important to note that it is still a database and, of course, is required to deliver on the baseline requirements to be a database. There are a set of expectations around the following:

  1. Administration: You should be able to easily install and configure the database with a set of command line and graphical based tools. This includes the abilities to control the environment and control data lifecycle for backup/restore. As well as the ability to create tables, define and implement schema, set indexes/partitions and recreate the DDL.
  2. Optimization: The database should allow a DBA to gain insight into the performance of queries and the ability to optimize their execution. This includes such advanced features as a cost-based optimizer, which in a distributed world becomes a tricky and new concept.
  3. Security: As with any enterprise software package, security is critical and the database should provide the key AAA capabilities of authentication, authorization, and accountability. It should not stand on its own and should integrate with a central source of truth for identity management and governance so that consistent policy can be set for the data (at the table, row and column levels) it contains.
  4. Integration: A database does not function alone and should integrate with your existing applications using well tested or known drivers. It should integrate well with existing ORMs and also provide the ability for bulk ingest or export. It should also provide key capabilities that allow it to work with ETL tools and change data capture capabilities to integrate with more advanced services such as streaming analytics or cloud storage.

These “foundational” requirements are critical and signal a more mature, enterprise-ready database. They may not be the most exciting features but are critical for adoption and success within any project.

How to evaluate distributed SQL databases

CockroachDB is a wonderful option for your cloud-native distributed SQL database. It has helped hundreds of organizations transition both the very most mundane workloads and some of the most mission-critical to the cloud. It has been the foundation of a cloud-native strategy within more advanced orchestrated environments. We are quite proud of what we have built.

We are also advocates of this emerging category and believe that distributed SQL is a proper evolution of the database and the future of the way we manage data in the cloud. To this end, we feel strongly that our solution and every other should be held to the highest regard when it comes to these core requirements:

  1. Scale
  2. Consistency
  3. Resilience
  4. Geo-replication
  5. SQL
  6. Locality
  7. Multi-cloud
  8. Administration
  9. Optimization
  10. Security
  11. Integration

A database that meets all of these requirements has sufficiently matured to be trusted for the mission-critical (and event not-so-mission-critical) applications. And some of these requirements are not simple. These are advanced topics that take time to get right.

When you discuss these items with your vendor, we encourage you to go deep on concepts like consistency and locality. While everyone has read the same papers, ultimately it comes down to implementation and (more importantly) production use. Because it is only in production where the complex corner cases and issues can be identified and fixed. So, I guess then, the final (and 12th) requirement is maturity.

Michelle Gienow github link linkedin link

Michelle Gienow is a recovering journalist turned front end developer based in Baltimore, MD. She creates content around her central obsessions: Jamstack, distributed architecture and developing a cloud native mindset.

Keep Reading

How we built a serverless SQL database

We recently announced general availability (GA) for Serverless, with support for change data capture (CDC), backup and …

Read more
What is a UUID, and what is it used for?

When working with a database, it’s common practice to use some kind of id field to provide a unique identifier for each …

Read more