Lessons learned from 2+ years of nightly Jepsen tests

Since the pre-1.0 betas of CockroachDB, we've been using Jepsen to test the correctness of the database in the presence of failures. We have re-run these tests every night as a part of our nightly test suite. Last fall, these tests found their first post-release bug. This blog post is a more digestible walkthrough of that discovery (many of the links here point to specific comments in that issue's thread to highlight the most important moments).

Two years of Jepsen testing

Running Jepsen tests in an automated fashion has been somewhat challenging. The tests run in a complex environment with network dependencies, spawn multiple cloud VMs, and have many potential points of failure, so they turn out to be rather flaky. If you search our issue tracker for Jepsen failures you'll see a lot of issues, but before the bug we're discussing here they were all benign - failures of our automation setup, not an inconsistency or bug in CockroachDB itself. By now, though, we've worked out the kinks and have the tests running reliably.

Our Jepsen test suite includes 7 different test workloads, combined with a number of different nemesis processes to simulate failures. Every night we start up a 5-node cluster and run each test+nemesis combination for 6 minutes each.

The test that found the bug was the register test with the split nemesis. The register workload is perhaps the simplest possible one for consistency tests: it performs simple single-key reads and writes, and verifies that reads see the last value written. The split nemesis uses the CockroachDB SPLIT AT command to cause a series of splits in the underlying KV store.

Finding a real bug

On Sep 29 (two years to the day after the beta release that Aphyr tested), the nightly test run reported a failure in the jepsen/register/split test. The logs included Jepsen's signature table-flipping error message Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻, indicating that this is a real inconsistency, not an automation failure.

The inconsistency involved a single key, 167. At 11:54:47, we read a value of 1 for this key. At 11:54:48, we read a value of 4. In between, there were attempts to write a value of 4 for this key, but they all failed. The split nemesis also activated during this time, suggesting that the split process may be involved in the bug, but it's not conclusive.

Since we later read the value 4, one of those failed writes must have in fact succeeded. In CockroachDB (as in any distributed system), failures are sometimes ambiguous in the event of a server being killed - it is impossible to tell whether the operation succeeded or failed. Jepsen understands this, and when it encounters an ambiguous error it considers both possibilities. This gave us our first theory of the bug: an ambiguous failure was getting incorrectly marked as unambiguous.

Searching for the cause

Armed with this theory, we started looking for possible causes. What kinds of errors was this test encountering that might be incorrectly classified? Fortunately we save packet captures for failed test runs, so we were able to dig in to the network traces and see. Initial analysis with wireshark pointed to the recently-added transactional pipelining feature. This feature altered error handling (in ways that sometimes made it harder to tell what exactly was going on), but we couldn't see any ways in which it could turn ambiguous failures unambiguous.

We also tried to reproduce the problem to see if we could get more information about it. Attempts to write a targeted test based on the theory that the split nemesis was responsible were fruitless. On Oct 24, nearly a month after the first failure, the nightly tests failed again. This time it wasn't the split nemesis but the majority-ring+subcritical-skews nemesis. This tells us a couple of things: our suspicion of the split process was likely misplaced, and the bug is pretty rare - it could easily have been lying dormant for a month or more before it was first observed.

With our split theory invalidated, we went back to the Jepsen tests as our best shot at reproducing the problem. To get more than one reproduction a month, Andrei built a test harness that ran 50 copies of the tests in parallel. This made it feasible for us to add additional logging and eventually get more details out of the failing tests.

The breakthrough

After a couple weeks of this (now two months since the first report), Tobi smelled a side-effect leak. In one of the traces he noticed that a waiting transaction was allowed to proceed at the same time that the transaction it was waiting on was going through a second iteration of the reproposal loop. Pulling on this thread revealed that when a transaction failed to commit on its first proposal, it would still signal waiting transactions as if it had.

The bug involves pipelined writes, a new feature in CockroachDB 2.1 described in detail in this blog post. In short, pipelined writes are performed asynchronously. We start the process of writing, but instead of waiting for the write to finish we move on to the next step in the transaction, and then before the transaction commits we verify that the write completed successfully.

Now we can walk through what happens when the bug occurs:

Transaction A begins its (pipelined) write: UPDATE test SET val=4 WHERE id=167. This leaves a write intent on key 167.
Transaction B tries to read key 167. It sees the pending write intent and goes into the transaction wait queue.
Transaction A tries to commit. With pipelined writes, when transactions commit they must also verify that the intents they wrote earlier are actually present. The first commit attempt passes this check, but while it is in progress a leadership transfer occurs. The old leader aborts its work in progress and the commit will be automatically retried on the new leader.
Now we hit the bug: Transaction A erroneously reports to the transaction wait queue that it has committed.
Transaction B wakes up and springs into action. Since it has been informed that transaction A committed, it proceeds to resolve the write intent on key 167. This means that from the perspective of an outside observer, key 167 now has the value 4 even though transaction A hasn't completed yet.
Transaction A makes a second attempt to commit (on the new leader). This time the commit fails because the pipelined write intent is no longer there - it was resolved by transaction B.

Now both transactions are confused. Transaction A believes that it has failed and reports an (unambiguous) error to the client. Transaction B believes that transaction A succeeded, and resolved its intent to make this effectively true. If these had been multi-key transactions, the situation could have been even worse: transaction B could have committed some intents while transaction A aborted others. This would mean that transactions that should be atomic could be broken up, violating one of the pillars of ACID. The bug is more severe than it first appeared, when we thought it was just about misclassifying ambiguous errors (although we never saw an in-the-wild instance of this more severe failure mode).

The fix

After all this, the fix was almost embarrassingly simple: the addition of a single if statement (in a patch that also included some other changes for testing).

Unfortunately, the two months of this investigation overlapped with the Oct 30 release of CockroachDB 2.1, so the bug was present in that release and fixed a few weeks later in 2.1.1.

On further investigation, the bug turned out to be quite old - it had been present in the codebase since version 2.0 (and was fixed for that branch in version 2.0.7). Version 2.0 introduced the transaction wait queue but did not yet have write pipelining. That's enough to trigger some version of the bug in theory, although in practice write pipelining makes unexpected outcomes much more likely.

What we learned

This issue demonstrates both the power and limitations of Jepsen as a testing tool. It's one of the best tools around for finding inconsistencies, and it found a bug that nothing else did. However, it's only as good as the workloads in the test suite, and the workloads we had were not very sensitive to this particular error. It took a long time to hit the error, and even when we encountered it, it wasn't very clear what had gone wrong. Developing new test workloads that are more sensitive to this error (and more importantly, other bugs that may be introduced in the future) remain an open development project.

If building out and proving the correctness of a distributed SQL system is your jam then good news — we’re hiring! Check out our open positions here.