Primary Keys in Legacy SQL Databases vs CockroachDB
The common pattern for primary keys in legacy SQL databases like MySQL or PostgreSQL includes a primary key that's just a simple ID column. This ID column, is just a sequence of integers. So, the primary key is a sequence of integers that are just increasing monotonically.
For example if you have a table that stores names of users in an application and new users need to be inserted into the table the new rows would have have numerically sequential IDs of five, six, seven, et cetera. Now, in a traditional database, this is a common pattern and completely fine. However, this pattern is not sufficient in CockroachDB.
Let's talk a little bit about why that is. So, when you're thinking about CockroachDB, one thing to keep in mind is all of the primary keys on a table are actually stored in a sorted order. The first row here, with an ID of one, comes before the next row with an ID of two, et cetera. So, when I'm inserting new data into this table, what's actually happening is all of this data is being appended to the end. So you can imagine, as I'm inserting new rows, this is all being added to the end of the table. This also interacts with another architectural consideration in Cockroach, which is "ranges".
Now, what are ranges? Inside of Cockroach, we actually chunk your data so that we can replicate it and then distribute it throughout your cluster. The way we do that is we chunk data into 64 megabyte ranges. Each range is 64 megabytes maximum. When you're inserting data into a table, what happens when you hit this max is it splits into two ranges. So you can imagine, as you continue inserting data into this table, you're going to continue increasing the number of ranges in this table.
Another concept that's important to keep in mind is for a given range, you'll have three replicas, however, only one of those replicas is the leaseholder replica, and the leaseholder replica is actually coordinating all of the writes and reads for that particular range. This becomes interesting when you combine it with the fact that we're storing all of your primary keys in a sorted order. So, when you're inserting into a table, by just increasing this integer ID column and adding rows to the end, what's actually happening is you're inserting all of your data into a single range and that range has a single leaseholder which only lives on a single node. The issue with this structure is that all of your inserts are going into a single point of failure, essentially, a single bottleneck. You want to avoid this, as much as, possible.
The standard best practice for handling this, is to switch from using integers to using universally unique identifiers. When you use UUIDs, you're generating random primary keys and, therefore, when you insert data into this table, it's actually hitting all of these ranges, roughly equally. And because of that, you're actually able to take advantage of the parallelism on this table by hitting all of the leaseholders equally and distributing your requests throughout your entire cluster instead of concentrating them in a hotspot on a single node.
That was just a little bit about primary keys in CockroachDB. I hope that's useful if you're considering transitioning from a legacy database. If you have additional questions, join our community slack to chat with other CockroachDB users.