How to leverage geo-partitioning

Introduction

As we’ve written about previously, geographically distributed databases like CockroachDB offer a number of benefits including reliability, security, and cost-effective deployments. We believe you shouldn’t have to sacrifice these upsides to realize impressive throughput and low latencies. That’s why we created geo-partitioning.

This blog post defines two new features, geo-partitioning and archival-partitioning, as well as explains when you might want to leverage these features. We previously provided a sneak-peak walkthrough of geo-partitioning that can be found here.

Geo-partitioning & Archival Partitioning Defined

Geo-partitioning allows you to keep user data close to the user, which reduces the distance that the data needs to travel, thereby reducing latency and improving user experience. To geo-partition a table, you should define location-based partitions while creating a table, create location-specific zone configurations, and apply the zone configurations to the corresponding partitions.

Archival-partitioning allows you to store infrequently-accessed data on slower and cheaper storage. To archival-partition a table, you should define frequency-based partitions while creating a table, creating frequency-specific zone configurations with appropriate storage devices constraints, and applying the zone configurations to the corresponding partitions.

For a more detailed description of both geo-partitioning and archival-partitioning please consult our documentation.

Why should you use geo-partitioning or archival partitioning?

Reduce latency

Companies increasingly insist on low latencies to minimize delays which cause a direct decline in revenue. For example, 100ms of latency costs Amazon 1% in sales and an extra 500 ms in search page generation time dropped Google’s traffic by 20%.

With geo-partitioning, developers can use what they know about a user’s location to keep data close to them while taking advantage of CockroachDB’s durability guarantees of consistency in the face of machine (or even datacenter) failure. Keeping data close to users reduces the distance data needs to travel, thereby reducing latency and improving end-user experience.

Governments continue to introduce new and increasingly tough regulations that put significant financial and organizational pressure on companies. Regulations like the EU GDPR and China’s Cyber Security Law impose restrictions on where data can reside, which undermine some of the core operational and economic benefits of the cloud.

Geo-partitioning restricts the location of data to specific regions (i.e., data-domicile), such as EU countries. Regulations, in some cases (e.g., China), directly require data domiciling. In other cases (e.g., GDPR) restrictions on data location aim to provide data privacy as a fundamental right and allowing companies greater control for managing user data.

Save money with colder storage

Keeping all corporate data in the same storage class is costly and inefficient. This is especially pernicious when the benefits of a solid state drives are not needed for data that’s infrequently accessed or only being preserved for compliance purposes.

By using archival-partitioning data based on storage time, developers can use CockroachDB to move certain types of data to colder, slower, and cheaper storage.

How do other databases use partitioning?

At this point, you may be wondering, hasn’t some form of partitioning always existed in other databases? How is CRDB different?

CockroachDB automatically shards data

We built CockroachDB from the ground up as a distributed database. This means that some of the partitioning use cases you may be familiar with in other DBMSs are solved by superior features within CockroachDB. CockroachDB keeps your data in 64MB ranges that the system constantly keeps balanced in response to machine downtime and changes in load. Instead of manually sharding your data into partitions when it's too big or too high traffic to fit on one machine, we transparently (and automatically) shard all data (while allowing you to retain control over where it lives to maintain optimum performance). In the same vein, a simple replication zone metadata change moves data from one machine to another (such as when moving data from high-end storage tier to low-cost storage tier).

CRDB’s SQL planner automatically tunes performance

Other databases often use partitioning for performance tuning. "Partition pruning" will avoid sending requests to machines containing data that couldn't be in a partition (SELECT … FROM table WHERE date > '2018-01-01'). Our SQL planner does this for every query, whether the tables are partitioned or not, only contacting the relevant 64MB ranges.

We only contact the shards that are relevant for each query, addressing the smallest amount of potentially relevant data. Other databases use partitioning to exclude ranges of data that are not germane to the query in question; if a partition is a-b and the query >c, it will avoid the a-b.

Management/maintenance of tables

Partitioning can also be used to complete rolling upgrades or schema changes. By contrast, CockroachDB doesn’t need to use partitioning for this feature because we already have online schema changes. Online schema changes allow DBAs to run database changes without interrupting mission critical workloads.

Bulk load with import CSV

Finally, some databases use partitioning for bulk loading to avoid interrupting business as usual. CockroachDB doesn’t need to use partitioning for bulk import because CRDB’s architecture and automatic replication make it easy to bulk import data using the Import statement.

Geo-partitioning demo

To get a better understanding of how geo-partitioning works, you should check out the "Geo Replication" section of this blog post (though the whole thing is worth a read).

Future work + enhancements

We don’t yet support using partitioning to import data to or bulk delete data from an existing table. We view these as logical enhancements of our current work and we plan to support them in the future.

We also aspire to provide a set of features aimed at making regulatory compliance even easier for our users to follow. This includes allowing you to query only from partitioned tables based upon the regulatory conditions faced by their business.

We built CockroachDB as a high-performance distributed database. We hope that by using features like geo-partitioning and archival-partitioning we will continue to “make data easy” for you.

To learn more and try it out yourself, click here.

Illustration by Lea Heinrich