Implementing column families in CockroachDB

CockroachDB is a distributed SQL database built on top of a transactional key value store. We don’t (yet) expose the kv layer but it’s general purpose enough that we’ve used it to implement SQL without any special trickery.

The particulars of how we represent data in a SQL table as well as the table metadata are internally called the “format version”. Our first format version was deliberately simple, causing some performance inefficiencies. We recently improved performance with a technique called column families, which pack multiple columns in one kv entry.

Once implemented, column families produced dramatic improvements in our benchmarks. A table with more columns benefits more from this optimization, so we added a benchmark of INSERTs, UPDATEs, and DELETEs against a table with 20 INT columns and it ran 5 times faster.

Press on, dear reader, and I’ll explain the details of how we did it and how they work.

Format Version 1: CockroachDB Before Column Families

CockroachDB requires every SQL table to have a primary index; one is generated if it was not provided by the user. Our first format version stored the table data as kv entries with keys prefixed by the columns in the primary index. The remaining columns were each encoded as the value in a kv entry. Additionally, a sentinel key with an empty value was always written and used to indicate the existence of a row. This resulted in N+1 entries for a table with N non-primary index columns. Secondary indexes work a little differently, but we don’t need them for today.

This all results in something like:

/<tableID>/<indexID>/<primaryKeyColumns...>/<columnID> -> <4 byte CRC><encoded value>

And more concretely:

CREATE TABLE users (id INT PRIMARY KEY, name STRING, email STRING); INSERT INTO users (11, "Hal", "hal@cockroachlabs.com"); INSERT INTO users (13, "Orin", "orin@cockroachlabs.com"); /<tableid>/0/11/0 -> <empty> /<tableid>/0/11/1 -> "Hal" /<tableid>/0/11/2 -> "hal@cockroachlabs.com" /<tableid>/0/13/0 -> <empty> /<tableid>/0/13/1 -> "Orin" /<tableid>/0/13/2 -> "orin@cockroachlabs.com"

Note that columns never use ID 0 because it’s reserved for use as the sentinel. This is all described in much more detail in the original SQL in CockroachDB: Mapping Table Data to Key-Value Storage blog post. If you haven’t read it, I highly recommend you do.

The Trouble with Format Version 1

Everything has to start somewhere, and while our first format version worked, it was a little inefficient. The encoded primary index data in the key was repetitive, and there is an MVCC timestamp and checksum for each entry, collectively wasting disk space and network bandwidth.

Perhaps worse was that there was per-key overhead at the transaction level. Every key written within a transaction has a “write intent” associated with it. These intents need to be resolved when the transaction is committed, taxing performance.

While our disk format avoids the key repetition with an incremental prefix encoding, the timestamp and the checksum still create ~12 bytes of overhead per key, not to mention the intents.

Since the problem was using one kv entry per column in the table, the natural solution was to group multiple columns into one value. Several NoSQL databases use a similar technique and call each group a “column family”.

Implementing Column Families in CockroachDB

When we set out to implement column families, the first wrinkle was deciding whether to support get and set on individual columns in a family or to load and store an entire family to change one column. The former would allow us to make every table’s primary data one key value entry. Unfortunately, it would also require the kv layer to understand the encoding that packs multiple columns in one value. If we later decided to change the encoding, it would be much more difficult to migrate if it were baked into the key value layer. Plus, the tidy separation they’ve enjoyed so far has been a big help to testability and moving quickly. We felt this wasn’t a worthwhile tradeoff.

As a result, we support multiple column families per table, so that setting a small field doesn’t necessitate roundtripping any large fields in the same table.

Side note:
A common question we get is whether we support use of the key value layer directly. We don’t right now, but by using one entry instead of two, we’ve gotten much closer to eliminating the overhead of using the CockroachDB key value store via a two column key and value SQL table.

How Do Column Families in CockroachDB Work?

Before column families, the value of an encoded table column was structured as:

<crc><typetag><encodedvalue>

With column families, this is now:

<crc><columnid0><typetag0><encodedvalue0>...<columnidn><typetagn><encodedvaluen>

or for our example above

/<tableid>/0/11/0 -> <crc>/1/string/"Hal"/1/string/"hal@cockroachlabs.com" /<tableid>/0/13/0 -> <crc>/1/string/"Orin"/1/string/"orin@cockroachlabs.com"

Notably, the column IDs in the keys have been replaced by family IDs. The first family ID is 0, doubling as the sentinel, and is always present. We use a variable length encoding for integers, including column IDs. This encoding is shorter for smaller numbers, so instead of storing the column ID directly, we store the difference to keep them smaller. NULLs are omitted to save space.

A couple of the existing data encodings (DECIMAL and BYTES) didn’t self-delimit their length. It’s desirable if we can extract the data for some of the columns without decoding them all, so we added variants of these two encodings that are length prefixed.

A constant concern of working in any system that persists data is how to read old data with new code. We made column families backward compatible by special casing a family that’s only ever had one column; it’s encoded exactly as it was before (with no column ID). This also happens to have the side benefit of being a nice space optimization.

All this and more is detailed in the Column Families RFC if you’re interested.

Using Column Families

When a table is created, some simple heuristics are used to determine which columns get grouped together. You can see these assignments in the output of the `SHOW CREATE TABLE` command.

CockroachDB can’t know the query patterns of a table when it’s created, but the way a table is queried has a big impact on the optimal column family mapping. So, we decided to allow a user to manually tune these assignments when necessary. A small extension (`FAMILY`) was added to our SQL dialect to allow for user tuning of the assignments. The various tradeoffs are detailed in our column families documentation.

Building a SQL database after the rise of NoSQL means that CockroachDB gets to pick the best parts of both. In this case, we were able to use column families, an optimization commonly found in NoSQL databases, to speed up our SQL implementation. The resulting performance improvement moves us one step closer to our 1.0 release.

Does jamming on distributed SQL put a spring in your step? Then good news — we're hiring! Check out our open positions here.