In any database, choosing which fields of a table to index is one of the most important jobs of the database designer. This is especially true for a distributed database like CockroachDB, where the cost of a poorly-chosen index can be higher. In this post I’ll discuss some guidelines for choosing good keys and pitfalls to avoid.
3 Basic Rules for Choosing Indexes
There are a few basic rules to keep in mind when choosing indexes for a database. A good index should have these three properties:
- Usefulness: Speed up the execution of some queries (or enforce a constraint)
- Clustering: Keep records that are likely to be accessed together near each other
- Scattering: Keep records that are unlikely to be accessed together far apart
The first rule is simply a reminder that indexes aren’t free, and if it’s not helping the application somehow you’re better off without it. Each additional index makes all writes to the table slower, but they can make some reads much faster. The best case for write performance is a table with a primary key but no secondary indexes. The first secondary index has a high cost because it means any insert to the table becomes a distributed transaction, but the cost of each additional secondary index is smaller. Note that it’s never helpful to remove the primary key–CockroachDB will create a hidden primary key for any table that doesn’t have one, and this hidden key cannot be used for by any queries, so it’s always better to have a real primary key that you choose. Do not let this scare you away from secondary indexes, however. They’re not free, but they have such a transformative impact on read performance that it’s nearly always worthwhile to ensure that every query you run has a suitable index.
The second rule, clustering, is a little more subtle. When an application needs to load multiple records at once (which could happen, for example, due to a
JOIN or the use of the
IN operator), it’s best for performance if those records are near each other. Originally, this advice came about to minimize the number of seeks that must be performed on spinning HDDs. In a distributed database like CockroachDB, the same guideline serves to minimize the number of network operations to access data remotely. For instance, in a social network news feed, most page views only need data from the current day, so organizing the data by time may provide the best data clustering and cache efficiency (or maybe not, as we’ll see below). CockroachDB offers a few SQL extensions that can further improve data clustering, including interleaved tables and storing indexes.
The third rule, scattering, is in some sense the converse of the second: when similar records are near each other, different records naturally have to go somewhere else. However, it’s not always the case that improving clustering also increases scattering. In the social network news feed example, organizing records by time maximizes clustering, but it also creates a hotspot because all the posts happening right now are trying to write to the same place. This is a severe limitation on the application’s ability to scale – if a hotspot like this exists, it’s not necessarily possible to serve more users by adding more nodes to a CockroachDB cluster. In practice, clustering and scattering are more often in tension with each other than they are mutually reinforcing.
Options for Selecting Unique IDs
If a table doesn’t have a natural primary key, you’ll probably want to synthesize some sort of unique identifier for each record. For this, you have a few options:
The simplest (but often the least performant) way to generate unique IDs is to start with 1 and count up. This is what you get from the
SEQUENCE feature in CockroachDB and PostgreSQL, or the
AUTO_INCREMENT keyword in MySQL. This produces IDs that roughly correspond to insertion order, and many users like the fact that the IDs produced are small integers (on the other hand, sequential IDs can give away the details of how many users/photos/etc your application has).
Unfortunately, sequential IDs are not ideal for distributed databases. The sequence becomes a bottleneck that all insertions must wait for, so throughput is limited by the nodes responsible for the sequence counter, and adding more nodes to the cluster won’t necessarily improve performance.
Note: Sequences are non-transactional
Why do IDs only roughly correspond to insertion order? (This is true in most databases, not just CockroachDB) Because when a transaction inserts a record, it gets an ID when it reaches the INSERT statement, but that record doesn’t become visible to other readers until the transaction COMMITs. That means that it’s possible to see records appear to be out of order, like this:
- Transaction A inserts a record which is assigned ID 1
- Transaction B inserts a record which is assigned ID 2
- Transaction B commits
- Transaction C reads from the table and sees record 2
- Transaction A commits
- Transaction C reads from the table and sees records 1 and 2
If your application requires IDs to strictly correspond to insertion order, you can do something like
INSERT INTO tbl (id, …) VALUES ((SELECT max(id)+1 FROM tbl), …). However, this has a very high performance cost since it makes all insert transactions wait for their turn to insert the next ID, so only do this if your application really requires strict ID ordering. Using change data capture (CDC) can help avoid the requirement for strict ID ordering in many applications, letting you use higher-performance ID strategies.
Timestamps are roughly ordered and nearly unique (and collisions can be handled by adding random bits or using UNIQUE constraints in the database). They’re more scalable than sequences since they don’t require a single-key bottleneck to maintain the sequence counter, and therefore they’re a good fit for distributed databases.
CockroachDB uses timestamps plus random bits as the default for the SERIAL column type and the
unique_rowid() function. When insertion-order clustering is more important than scattering for your application, we recommend timestamp-based IDs.
The third major option for ID generation is to use large random numbers, usually via the 128-bit UUID type. As you might expect, random IDs maximize scattering but don’t give any clustering. Random IDs usually give the best raw INSERT performance in CockroachDB because they allow all the nodes in the cluster to be fully utilized, although the lack of clustering can hurt the performance of some queries.
Even though timestamps avoid the worst bottlenecks of sequential IDs, they still tend to create a bottleneck because all insertions are happening at around the current time, so only a small number of nodes are able to participate in handling these writes. If you need more write throughput than timestamp IDs offer but more clustering than random UUIDs, you can use sharded keys to spread the load out across the cluster and reduce hotspots.
There are many variations on the idea of sharded keys, but for this article we’ll focus on a composite key that combines the real key with a shard ID derived from a hash. This is a static sharding scheme, so we’ll need to choose a number of shards in advance. This number should be somewhat higher than the number of nodes in your cluster (or the number of nodes you expect to grow into in the near future). For this example, we’ll use 16 shards, and represent it as a string containing a single hex character (leaving room for expansion in the future).
Consider this table containing posts on a social media site:
CREATE TABLE posts ( id SERIAL PRIMARY KEY, author_id INT8, ts TIMESTAMP, content TEXT, INDEX (author_id, ts));
The SERIAL primary key is a write bottleneck: it forces all inserts to this table to go to a small number of ranges. We’d like to change this to allow more write throughput, but going all the way to UUIDs could hurt the performance of our most important read query (which looks something like
SELECT * FROM posts WHERE author_id IN ? ORDER BY ts DESC LIMIT 50) by reducing clustering. Instead, we’ll shard the primary key 16 ways, which will let us use up to 16 times as many nodes for increased write throughput.
The revised table definition looks like this:
CREATE TABLE posts ( shard STRING AS (substr(sha256(id::string), 64)) STORED, id SERIAL, author_id INT8, ts TIMESTAMP, content TEXT, PRIMARY KEY (shard, id), INDEX (author_id, ts));
substr(sha256(id::string), 64)) expression gives us a shard key that A) has 16 distinct values (one hex digit), B) is uniformly distributed, and C) is easily and deterministically computed from the generated id (it’s a little more expensive than necessary, but it’s unlikely to be a big deal in practice and using well-known functions like
sha256 makes it easy to reproduce this computation in other languages if needed).
Note that only the primary key changed – the index by author and time stayed the same because one user can’t post so often that the per-author index becomes a bottleneck. In other applications, it might be necessary to apply sharding to secondary indexes as well.
Our table is now more complicated: how do we use this new
shard column? The
STORED column feature means that this column’s value will be generated automatically when the row is inserted or updated, so we don’t need to change our
INSERT statements. However, any query that specifies a value for the
id column must now specify a
shard as well. This includes updates, deletes, and queries by id (but not queries by other indexes like the (author_id, ts) index). The shard should be computed as a function of the
id column. You can do this however you want, in SQL or in your application code, but the important thing is to be consistent. Queries that used to be
SELECT * FROM posts WHERE id=? become
SELECT * FROM posts WHERE id=? AND shard=substr(sha256(?::string), 64) (or you could save the shard value from a previous access to this row.
The trickiest thing about sharding is when we want to do a query that touches all shards. For example, in the original schema, it was easy to generate a “firehose” of recent posts with
SELECT * FROM posts ORDER BY id DESC LIMIT 10. This is a very expensive query on our new schema because it requires a sort of the entire table! Instead, we must ask the database to search each shard separately:
SELECT * FROM posts WHERE shard IN ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f') ORDER BY id DESC LIMIT 10.
This is admittedly tricky and intrudes into application code. We’re exploring ways to make this process more automatic. In our 19.2 release, we’ll have a feature that can infer
IN clauses like the one above automatically from
CHECK constraints on the schema, and we may introduce ways to abbreviate some of the tricks shown here.
We’ve focused on timestamps here, but the principles of key clustering and scattering apply to other domains too. For example, a web crawler may be concerned about the distribution of URLs, and may want to use a similar sharding scheme to the one described here.
Interleaved tables (a CockroachDB extension to SQL) change the cost/benefit analysis of the different key options. A top-level table with interleaved children cannot perform scans that take advantage of ordered keys, while the children inherit any scattering from their parents’ keys. Therefore, when using interleaving, it’s usually best to use random (UUID) keys on the top-level table, and ordered keys (SERIAL or timestamp) on all child tables.
If we reconsider our social network table above and interleave the
posts table into a
users table, the schema could look like this:
CREATE TABLE users ( id UUID PRIMARY KEY, name STRING, ); CREATE TABLE posts ( author_id UUID, ts TIMESTAMP, content TEXT, PRIMARY KEY (author_id, ts), FOREIGN KEY (author_id) REFERENCES users (id) ON DELETE CASCADE, ) INTERLEAVE IN PARENT users (author_id);
Compared to the sharded schema above, the interleaved version would be more efficient for reading a single user’s posts, but less efficient for gathering posts from a large number of users.
If optimizing indexes for distributed environments makes your brain tingle, come join the Cockroach Labs team!