Index selection in CockroachDB

In an earlier post we discussed how CockroachDB maps SQL table data and table indexes to key-value storage. In this post, we will go over some of the factors involved in choosing the best index to use for running a certain query.

Introduction to indexes

Tables are internally organized according to a certain column (or group of columns); this makes searching for rows according to values in that column or column group very efficient, even if the table contains a large number of rows. But what if sometimes we need to search according to values for a different column? To find such results the system would have to go through all the rows in the table, which can be painfully slow for large tables. We can avoid this by telling the system to maintain an index: an additional structure which organizes the data according to a different column (or group of columns).

We illustrate this with a simple example. Consider that we are maintaining a database of songs. We may have a Songs table as follows:

CREATE TABLE Songs ( ID INT PRIMARY KEY, Title STRING, Artist STRING, Album STRING, Year INT )

The ID is a number used to uniquely identify a certain song. The rows of this table are internally stored in order according to the ID, which allows us to retrieve the details of a certain song very efficiently:

EXPLAIN SELECT * FROM Songs WHERE ID = 123

The EXPLAIN statement is used to look at the details of how the SQL engine is going to run the query (known as a plan). In this case, the range /123-/124 is showing that we will scan rows with IDs at least 123 and strictly less than 124 – so just the row with ID 123. Ignore Level for now, we’ll get to that later.

Unfortunately, if we want to find the songs of a certain artist, CockroachDB will need to scan the entire table and compare each row against the artist name:

EXPLAIN SELECT ID FROM Songs WHERE Artist = 'Rick Astley'

The unbounded - range means we have to scan the entire table. This is no good if we have a sizable database of songs. The solution is to use an index on the artist:

CREATE INDEX ByArtist ON Songs (Artist)

This index allows us to efficiently find things by the artist name, at the cost of keeping the index updated whenever we make a change to the Songs table.

The details of how the index is stored internally are discussed in another blog post. For the purposes of this discussion, we can think of the index as a list of entries, one for each song in the original table. Each entry contains the artist and the ID (“primary key”) of the song. The entries are available in sorted order. With this, the query above looks better:

In English: we will directly retrieve the entries that exactly match the given artist name.

Let’s take a closer look at this example. We mentioned that each entry in the ByArtist index contains the artist and the song ID; and our sample query only asks for the ID of the song we are looking for – how (suspiciously) convenient!

But what if we want more information, like the title of the song?

Covering vs non-covering indexes

Let’s see what happens when our query requests a column that is not part of the index:

EXPLAIN SELECT ID, Title FROM Songs WHERE Artist = 'Rick Astley'

What just happened? Because the index only contains the song IDs for each artist, we cannot retrieve the song title from the index. We need to “join” the data from the index with the data in the main (primary) table, which involves getting the song IDs from the index and then retrieving the corresponding titles for those IDs from the main table. The implementation of this query is divided into reusable pieces of functionality that form a tree. The root of the tree (level 0) is the index-join node, which coordinates the retrieval of rows performed by the two scan nodes (leaves) on level 1.

Retrieving the titles from the main table happens in batches: each batch corresponds to a set of song IDs from the ByArtist index. In general, the rows we are looking for in the main table are at arbitrary positions (and not sequential), making this retrieval more expensive than a regular scan.

For this query, ByArtist is what we call a non-covering index, in that it doesn’t “cover” all the columns we need to retrieve. For the previous query (where we only wanted the IDs), the same index was covering. A covering index is preferable to a non-covering index in terms of efficiency.

What if the query above is frequent and we want to make it more efficient? We can change the index to make it store the title as well:

CREATE INDEX ByArtist ON Songs (Artist) STORING (Title)

Alternatively we could index both the artist and the title, with a similar result:

CREATE INDEX ByArtist ON Songs (Artist, Title)

The difference is that this index not only makes the song titles available, but it makes the songs of any given artist available in sorted order – which can be useful for some queries (as we will see). Of course, regardless of the method, changing the index to store an extra column is not free: the space used by the index will increase accordingly. One needs to be careful when adding columns to indexes, especially if the amount of data stored in those columns can be large.

Because using a non-covering index is more expensive, this is a factor that we care about during index selection – but it is not the most important factor.

Restricting the search space

When considering different indexes, the main objective is to minimize the amount of row data we have to scan; the more restrictive our search ranges are, the better.

Suppose we want to find all albums starting with the letter W that came out either in 1987 or 1989 and we have an index where the songs are primarily ordered by album:

CREATE INDEX ByAlbum ON Songs (Album, Year, Title) EXPLAIN SELECT Title FROM Songs WHERE Year IN (1987, 1989) AND Album >= 'W' AND Album < 'X'

We can narrow down the search space to those albums with the correct starting letter – all strings that are lexicographically ordered after W and before X.

But if we have an index where songs are primarily ordered by year, and within the same year ordered by the album:

CREATE INDEX Discographies ON Songs (Year, Album, Title)

Then the search can be much more focused; the same query yields the following plan:

The first range maps to albums released in the year 1987 which are lexicographically ordered between W and X. This index is preferable (for this query), and it will be chosen by the system when both indexes in this example are available.

Index ordering

Another factor involved in choosing an index is the desired result ordering.

Suppose we add another index:

CREATE INDEX ByYear ON Songs (Year, Title)

Say we want to look for all songs with a certain substring in the title and we want the results ordered by artist:

EXPLAIN SELECT ID, Title FROM Songs WHERE Title LIKE '%Give You Up%' ORDER BY Artist

Or by year:

EXPLAIN SELECT ID, Title FROM Songs WHERE Title LIKE '%Give You Up%' ORDER BY Year

Unfortunately we can’t do anything better than going through all the songs and checking the title. But we can at least choose the index that gives us the results in the order we want, avoiding any resorting (as signified by the nosort node).

Seems simple enough conceptually, but there are some subtle nuances to matching an ordering. For example, the ByArtist index can be used to provide an order by title if we know all results have the same artist:

EXPLAIN SELECT ID FROM Songs WHERE Artist = 'Rick Astley' ORDER BY Title

This is because ByArtist is an index on (Artist, Title) so it not only stores entries in artist order, but within the same artists they are in title order. Of course, we cannot avoid sorting if our query can potentially involve multiple artists:

EXPLAIN SELECT ID FROM Songs WHERE Artist >= 'R' AND Artist < 'S' ORDER BY Title

A particular ordering might also be desired when using aggregate functions MIN and MAX. If we want to get the first artist name (in lexicographical order), using the ByArtist index is immensely useful: we only need to get first entry in the index:

EXPLAIN SELECT MIN(Artist) FROM Songs

The 1:- signifies that within the otherwise unbounded range -, we only need the first entry. As above, the same index is as useful if we want the first song of a certain artist:

Limits

Given the multitude of factors to take into consideration, it is not surprising that many times we have to make a non-trivial choice between indexes. We may have an index that provides the ordering we are looking for, but might be less restrictive in terms of search space, or might be non-covering. Sometimes the LIMIT clause of a SELECT statement can tip the scale and change the outcome.

Getting back to the first index we looked at:

CREATE INDEX ByArtist ON Songs (Artist)

As discussed, If our query must return more than the song ID and the artist, this index is non-covering and we must cross-reference each result with the row in the original table. So if we want song titles, even if we want them in the order that matches this index, the system will choose not to use it:

EXPLAIN SELECT Title FROM Songs ORDER BY Artist

Because we have to go through all the songs anyway, it is cheaper to sort the results ourselves rather than using the non-covering index.

But a LIMIT can throw a wrench in the works of this reasoning: if we only want the first few songs, it will be very wasteful to scan the entire table, sort the rows, and then throw away everything except the top results. In this case, using the index is preferable – despite the overhead associated with non-covering indexes – as we will only have to go through a handful of rows:

EXPLAIN SELECT Title FROM Songs ORDER BY Artist LIMIT 10

Because we don’t have to sort the results – they just flow through the nosort node – we can stop scanning for rows as soon as we get the 10 results we are looking for (as orchestrated by the limit node).

Index hints

What if we hit a case where the index algorithms don’t correctly identify the best index? Hopefully that is rare, but CockroachDB provides a way to force the use of a specific index by appending @index_name or, equivalently, @{FORCE_INDEX=index_name} to the table name.

Take the following indexes and query:

CREATE INDEX ByArtist ON Songs (Artist) CREATE INDEX ByTitle ON Songs (Title) EXPLAIN SELECT Title FROM Songs WHERE Artist >= 'A' AND Artist < 'Z' AND Title >= 'A' AND Title < 'F'

This is a query where both indexes would work well. If the ByTitle index is more efficient, we can force it using either one of these statements:

EXPLAIN SELECT Title FROM Songs@ByTitle WHERE Artist >= 'A' AND Artist < 'Z' AND Title >= 'A' AND Title < 'F' EXPLAIN SELECT Title FROM Songs@{FORCE_INDEX=ByTitle}; WHERE Artist >= 'A' AND Artist < 'Z' AND Title >= 'A' AND Title < 'F'

If we want to disallow use of non-covering indexes, we can use @{NO_INDEX_JOIN}. We can also disallow using any index by specifying @primary:

EXPLAIN SELECT Title FROM Songs@{NO_INDEX_JOIN} WHERE Artist >= 'A' AND Artist < 'Z' AND Title >= 'A' AND Title < 'F' EXPLAIN SELECT Title FROM Songs@primary WHERE Artist >= 'A' AND Artist < 'Z' AND Title >= 'A' AND Title < 'F'

Future directions

CockroachDB is starting out with a solid index selection implementation that takes into account all the factors we discussed. But the choices we make are “static” in the sense that they don’t depend on the data itself – only on its structure. In the future we plan to keep track of various metrics like table sizes and distribution of values to make more intelligent decisions. We will also consider feedback mechanisms for frequent queries, where we sometimes try out a different plan, analyze its efficiency, and potentially switch to that plan in subsequent instances of similar queries.

Do all things distributed SQL and indexing make you giddy? Then good news — we're hiring! Check out our open positions here.