Adventures in performance debugging

As we’ve built CockroachDB, correctness has been our primary concern. But as we’ve drawn closer to our beta launch, we’ve had to start paying significantly more attention to performance. The design of CockroachDB always kept performance and scalability in mind, but when you start measuring performance, there are inevitably surprises. This is the story of the detection, investigation, and fix of just one performance bug.

First, a little context about CockroachDB for those new to the project. CockroachDB is a distributed SQL database which employs RocksDB to store data locally at each node. A basic performance test is to write random data into a table and test how many operations per second can be achieved. This is exactly what the block_writer example program does. It uses a simple table schema:

CREATE TABLE IF NOT EXISTS blocks ( block_id BIGINT NOT NULL, writer_id STRING NOT NULL, block_num BIGINT NOT NULL, raw_bytes BYTES NOT NULL, PRIMARY KEY (block_id, writer_id, block_num) )

And then spawns a number of workers to insert data into the table:

INSERT INTO blocks (block_id, writer_id, block_num, raw_bytes) VALUES ($1, $2, $3, $4)

The block_id is randomly chosen and writer_id is uniquely assigned to each worker. The block_num field is monotonically increasing and ensures that there will never be duplicate rows inserted into the table. The effect is that we’re inserting random rows into the table and never experiencing contention. What could go wrong?

1. The Bug: Rapid performance deterioration

A few weeks ago my colleague Matt Tracy ran the block_writer and discovered rapid performance deterioration:

1s: 1944.7/sec 2s: 1067.3/sec 3s: 788.8/sec 4s: 632.8/sec 5s: 551.5/sec … 1m0s: 105.2/sec

Oh my, that isn’t good. Performance starts out at a reasonable 2000 ops/sec, but quickly falls by a factor of 20x. Matt suspected there was some scalability limitation with tables. He noted that once a table fell into this bad performance regime it stayed there. But if he dropped the blocks table and created a new one, performance reset only to degrade rapidly again.

Like any good engineer, Matt turned to cpu profiling to try and determine what was going on. Was there some loop with horrible algorithmic complexity based on the table size? Unfortunately, the profiles didn’t reveal any culprits. Most of the cpu time was being spent inside RocksDB, both during the good performance regime and the bad performance regime. The builtin Go profiling tools are quite good, but they are unable to cross the cgo boundary (RocksDB is written in C++).

2. Snowball tracing to the rescue

Matt was a bit stumped for how to proceed at this point. Conveniently, another engineer, Tobias Grieger, was experimenting with adding “snowball” tracing to SQL queries. Unlike sampling-based profilers which periodically stop a program and determine what code is running, a tracing system records implicit or explicit events associated with a specific request. The support Tobias was adding was a new EXPLAIN (TRACE) mode. After the addition of some more tracing events, here is what Matt saw:

I’ve edited the output for clarity and highlighted two lines which shed light on the problem. It should be clear that you can’t achieve 2000 ops/sec, or 1 op every 0.5 ms, if part of the time to perform an operation takes >7ms. It is interesting that this time is being consumed in writing the transaction record at the start of a transaction.

Matt continued to add more instrumentation until the problem was narrowed down to a single RocksDB operation. At this point Matt tagged me in since I’ve had the most experience with our usage of RocksDB. I came onto the field swinging and wrote a micro-benchmark that duplicated the behavior of the BeginTransaction and utterly failed to find any performance degradation. Hmm, curious. I decided to verify I could reproduce the block_writer performance degradation (trust, but verify) and, of course, the problem reproduced immediately. I also verified that checking to see if the transaction record was present at the start of a transaction was the time consuming operation.

3. RocksDB, CockroachDB, and MVCC

Now to provide a bit of background on CockroachDB’s MVCC (multi-version concurrency control) layout and transactions. In an MVCC system data is never overwritten; instead it is tagged with a version number (in CockroachDB this is a timestamp). Newer data “shadows” older data, but that older data is still present and is deleted via a background garbage collection process. What does this look like? Consider we have a key a at the timestamps 1 and 2 and key b at timestamp 3:

a@2 a@1 b@3

Notice that the keys are sorted by descending timestamp. This ordering plays nicely with the RocksDB iterator abstraction. There is also a special timestamp used for keys which hold info about an in-process write to a key. These “write intents” have a timestamp which sorts before any other timestamp. Let’s say there’s a write intent on key b:

a@2 a@1 b@intent b@3

If we want to find the latest version of key b we can seek the iterator to b@intent. Iterator seeking is defined to return the next key greater than or equal to the seek key. So by seeking to b@intent we’ll either find b@intent (if it exists) or b@3 or possibly c@ if there are no versions of b. Lastly, when a key is deleted a deletion marker is added:

a@4(deleted) a@2 a@1 b@intent b@3

4. Dead ends: bloom filters and skip lists

Ok, back to the performance degradation. We know that a RocksDB iterator seek of txn-key@intent is taking milliseconds to return. My mental model says this should be a very quick operation. Let’s figure out why it isn’t.

The first thing I checked was whether the RocksDB bloom filters were being utilized. Bloom filters are a probabilistic data structure that can determine whether an element is a member of a set. When starting a transaction, the transaction key has not been seen before, so the bloom filters should be able to quickly reject any more expensive operation.

When I turned on the logging of the bloom filter hit rate, I saw that the bloom filters were not being used at all. Huh? I scratched my head for a bit before turning on the logging for sstable creation, which is the structure where the bloom filter resides. Hmm, no sstables are being generated. This meant that the performance degradation was happening while all the data still resided in memory and before anything had been flushed to disk.

This points the finger at the RocksDB MemTable which holds recent writes and periodically flushes them to disk. Lookups in the MemTable should be very fast. The MemTable is implemented using a skip list – a neat data structure similar to a b-tree but that can be implemented in a lock-free manner. When you see “skip list” you should be thinking “fast, ordered data structure”. How can the seeks within the skip list be taking milliseconds? A quick bit of instrumentation in SkipListRep::Iterator::Seek absolved it of blame. Individual seeks within the skip list were taking a handful of microseconds which was in line with my expectations.

5. Deletion tombstones

A quick recap of where we are at:

We have only a single memtable containing all of the data.
An iterator into this memtable is sometimes taking milliseconds to perform a seek operation.
The memtable is implemented on top of a skip list and seeking inside the skip list is taking microseconds.
The memtable and skip list are both concurrent for readers and no locks are being taken.

So where is the time going?

This is the easy part of performance debugging. We have a reproducible test case and we have timing instrumentation which shows the problem at one level of the code and not at another. Ideally we’d perform a binary search of the code in between, narrowing our timing instrumentation until we identified by the culprit. Unfortunately, we have no easy tools to perform such a search so we have to do it manually. I won’t bore you with the specifics of the binary search and instead take you directly to the culprit: deletion tombstones.

What is a deletion tombstone? RocksDB is a log-structured merge-tree. Unlike a b-tree where mutations are performed “in-place”, a log-structured merge-tree performs edits similar to the way an accountant maintains finances: by adding records to indicate changes to the past instead of rewriting history. In RocksDB, when you overwrite a key you don’t actually delete the old key but instead add a record for the new key that has a newer timestamp (sequence number) that the old key. When you delete a key you add a record (a deletion tombstone) that shadows the existence of the old key.

The processing of deletion tombstones in RocksDB happens during the seek within the memtable. In pseudo-code this looks like:

for iter = skiplist.seek(key); !iter.done(); iter.next() { if !isDeletionTombstone(iter) { break } }

Instrumentation revealed that the above loop in RocksDB (located in DBIter::FindNextUserEntryInternal) was iterating upwards of 20,000 times when performance deterioration became noticeable.

6. The full story and the fix

The block_writer tool writes random rows in a table. The performance of writing the transaction record associated with each row write degrades rapidly. Transaction records are written at a key derived from the first key written within the transaction and that key is deleted when the transaction is committed. So during random insertion into a table we have a segregated part of the key space where we’re randomly inserting keys and almost immediately deleting them creating a large expanse of deletion tombstones. Before inserting the transaction key we’re seeking within this deletion tombstone expanse and the seek needs to iterate from the (non-existent) transaction key to the next non-deleted key. For example, let’s say we’re creating txn100 (txn’s have a UUID associated with them, I’m using sequential numbers here for conciseness) and we have the existing set of keys:

txn99@T(deleted) txn101@T(deleted) txn102@T(deleted) txn103@T(deleted) ... txn40000@T(deleted)

We can very quickly seek to find txn101@deleted but then we have to iterate until we find the next non-deleted key. It should be fairly clear that performance will quickly be dominated by that iteration.

So seeking into a memtable with a large contiguous region of deleted keys is a bad idea. What can we do to fix the problem? It turns out that RocksDB has a solution readily at hand: when you create an iterator you can specify an upper-bound past which the iterator will not iterate (see rocksdb::ReadOptions::iterate_upper_bound). CockroachDB’s MVCC layer was already indicating when it wanted to restrict iteration to a specific range of keys so it was straightforward to modify the code to allow specification of the iteration upper-bound.

Recall the performance we were seeing earlier:

1s: 1944.7/sec 2s: 1067.3/sec 3s: 788.8/sec 4s: 632.8/sec 5s: 551.5/sec ... 1m0s: 105.2/sec

After the fix, we see:

1s: 3028.0/sec 2s: 2929.2/sec 3s: 2742.2/sec 4s: 2921.9/sec 5s: 2859.4/sec … 1m0s: 2579.0/sec

Still a bit of a fall-off, but nothing close to the 20x decrease we were seeing before.

That concludes our performance debugging adventure. It’s interesting that not only did the fix improve stability of performance in this benchmark, but it also improved top end performance on the benchmark and performance on a number of other benchmarks that created similar swaths of deletion tombstones.