Why CockroachDB doesn't use EvalPlanQual

Here’s a surprising behavior of PostgreSQL you might not know about: under READ COMMITTED isolation, PostgreSQL can sometimes miss rows when performing UPDATE, DELETE, SELECT FOR UPDATE, or SELECT FOR SHARE statements. This is due to the EvalPlanQual recheck PostgreSQL adds to these statements to prevent lost-update anomalies.

For CockroachDB’s new implementation of READ COMMITTED isolation, we considered building our own version of EvalPlanQual, but decided to use a different technique instead which doesn’t miss rows. By not missing rows, CockroachDB alleviates the need for application-level retries.

Isolation levels without the anomaly table

What is EvalPlanQual?

Under READ COMMITTED isolation, PostgreSQL adds a special recheck step to UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE statements. This recheck is known as EvalPlanQual, and consists of a re-evaluation of part of the statement after all qualifying rows are locked. PostgreSQL uses EvalPlanQual to prevent lost updates under READ COMMITTED isolation.

To make this concrete, let’s walk through an example.

An example of conflicting updates in PostgreSQL and CockroachDB

Suppose we’re using SQL to organize a 3-on-3 basketball league. One of the tables in our schema holds information about every player in the league, including (a) their ID, (b) their name, (c) their skill level, and (d) their team assignment if they have one.

CREATE TABLE player (
    id    integer NOT NULL,
    name  text NOT NULL,
    level text NOT NULL,
    team  text,
    PRIMARY KEY (id)
);

INSERT INTO player VALUES
    (1, 'Gray',         'A',  'Dolphins'),
    (2, 'Mohan',        'A',  'Dolphins'),
    (3, 'Stonebraker', 'A',  'Dolphins'),
    (4, 'Lamport',      'A',  'Gophers'),
    (5, 'Ullman',       'A',  'Gophers'),
    (6, 'Lynch',        'A',  'Gophers'),
    (7, 'Bernstein',    'AA', 'Elephants'),
    (8, 'Liskov',       'AA', 'Elephants'),
    (9, 'Codd',         'AA', 'Elephants'
);

When modifying this table, we’ll try to maintain two invariants:

teams must have exactly 3 players
all players on a team must have the same skill level

After a wildly successful season, we decide to move all of the Gophers up to skill level AA. We use a single UPDATE statement in an implicit transaction, which should atomically modify all three players and thus maintain our invariants.

UPDATE player SET level = 'AA' WHERE team = 'Gophers';

(But to more easily demonstrate, let’s sleep for 5 seconds at the beginning of the update.)

WITH sleep AS (SELECT pg_sleep(5))
    UPDATE player SET level = 'AA' FROM sleep WHERE team = 'Gophers';

While that update is running, the Gophers and Dolphins decide to exchange players 3 and 4 in a trade. Again, we use a single UPDATE statement in an implicit transaction. This should atomically swap the skill level and team of the two players and thus, again, maintain our invariants.

UPDATE player p1 SET (level, team) = 
        (SELECT level, team FROM player p2
         WHERE p2.id IN (3, 4) AND p2.id != p1.id)
    WHERE p1.id IN (3, 4
);

The second update succeeds immediately. What happens to the first update? Let’s consider the result by both database and isolation level.

PostgreSQL’s result

In PostgreSQL, under SERIALIZABLE isolation, the first update fails with a “could not serialize” error. During execution, PostgreSQL detects that another transaction has concurrently modified players 3 and 4, and aborts.

Under READ COMMITTED isolation, the first update succeeds, but the result is anomalous: player 3 is now on the Gophers but is still at level A. Only two of the Gophers were moved up to level AA.

This violates invariant 2, which is surprising, because when considered individually both updates should have maintained our invariants.

CockroachDB’s result

In CockroachDB, under both SERIALIZABLE and READ COMMITTED isolation the result is the same: the first update succeeds, player 3 is now on the Gophers, and is correctly at level AA.

How does PostgreSQL arrive at its anomalous result, and how does CockroachDB avoid the anomaly? To answer these questions let’s examine execution of the first UPDATE statement under READ COMMITTED isolation in both databases.

A closer look at PostgreSQL’s UPDATE under READ COMMITTED isolation

We’ll start by looking at EXPLAIN in PostgreSQL. The plan for the first UPDATE statement seems quite simple: a scan of the table followed by a filter.

michae2=# EXPLAIN UPDATE player SET level = 'AA' WHERE team = 'Gophers'; QUERY PLAN ------------------------------------------------------------- Update on player (cost=0.00..1.11 rows=0 width=0) -> Seq Scan on player (cost=0.00..1.11 rows=3 width=38) Filter: (team = 'Gophers'::text)

But that Update on player operation hides a lot of complexity. We can think of PostgreSQL executing the update under READ COMMITTED isolation in roughly seven steps.

Establish a read snapshot for the statement. This snapshot is before the second update begins.
Scan player at the read snapshot. (Seq Scan in the plan.)
Filter on team = 'Gophers'. (Filter in the plan.) Players 4, 5, and 6 qualify since the read snapshot is before the second update.
Lock all qualifying rows. (Part of Update in the plan.)
Re-read the latest committed version of all locked rows. (Seq Scan in the plan again.) This picks up the change to player 4.
Re-run the filter on the latest committed versions. (Filter in the plan again.) Now only players 5 and 6 qualify.
Write a new version of each qualifying row. (Update in the plan.)

Steps 5 and 6 are PostgreSQL’s strategy for preventing lost updates under READ COMMITTED isolation. After locking all qualifying rows, PostgreSQL re-evaluates the query steps a second time on the latest version of each locked row. This is the EvalPlanQual recheck. By doing this recheck, PostgreSQL picks up any changes made to those rows between reading them and locking them, which in this case correctly prevents modifying the skill level of player 4. Unfortunately, because PostgreSQL only rechecks the locked rows, it misses the change to player 3. We were expecting at least one of those two players to move up to skill level AA, but neither did.

This behavior is documented, but it can still cause surprises.

A closer look at CockroachDB’s UPDATE under READ COMMITTED isolation

Next, let’s look at EXPLAIN in CockroachDB. Again, the plan for the UPDATE statement seems quite simple: a scan of the table followed by a filter.

demo@127.0.0.1:26257/demoapp/defaultdb> EXPLAIN UPDATE player SET level = 'AA' WHERE team = 'Gophers'; info ------------------------------------------------------------------------------------------- distribution: local vectorized: true • update │ table: player │ set: level │ auto commit │ └── • render │ └── • filter │ estimated row count: 3 │ filter: team = 'Gophers' │ └── • scan estimated row count: 9 (100% of the table; stats collected 8 seconds ago) table: player@player_pkey spans: FULL SCAN

Again, that update operation hides a lot of complexity. We can think of CockroachDB executing the update under READ COMMITTED isolation in roughly six steps.

Create a savepoint in the current transaction.
Establish a read snapshot for the statement. This snapshot is before the second update begins.
Scan player at the read snapshot. (scan in the plan.)
Filter on team = 'Gophers'. (filter in the plan.) Players 4, 5, and 6 qualify since the read snapshot is before the second update.
Write an intent for each qualifying row.¹ (update in the plan.)
While writing intents, if the latest committed version of a row is newer than our read snapshot, rollback to the savepoint and go back to step 1. This gives us a new read snapshot which is after the second update.

Step 6 is CockroachDB’s strategy for preventing lost updates under READ COMMITTED isolation. CockroachDB retries the entire UPDATE statement with a new read snapshot if it encounters a newer version of a row while writing an intent.² By executing the entire statement again with a new read snapshot, CockroachDB can pick up any changes made after the previous read snapshot, which in this case includes changes to both players 3 and 4. Crucially, CockroachDB holds onto locks and intents across retries, which helps it make progress even during periods of heavy contention.

How to prevent lost updates: a tradeoff?

At first glance these two different techniques for preventing lost updates seem to embody a tradeoff: on the one hand, PostgreSQL avoids retries, on the other hand, CockroachDB avoids anomalies. But if we consider things from the application’s point of view, this is less of a tradeoff than it first appears. What would an application need to do to be sure that all Gophers were updated to skill level AA?

When using PostgreSQL, the application itself would need to retry the first UPDATE statement until all Gophers were at level AA. Application-level retries are more expensive than database-level retries, due to network latency, so this is strictly worse than retrying within the database. There’s no way to avoid retrying the first UPDATE statement in this scenario, but CockroachDB is able to hide the retry from the application.

Conclusion

PostgreSQL adds EvalPlanQual rechecks to UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE statements under READ COMMITTED isolation to prevent lost-update anomalies. EvalPlanQual lets PostgreSQL avoid internal retries of these statements, but it can cause PostgreSQL to miss rows which will in turn require application-level retries to handle.³

In CockroachDB’s new implementation of READ COMMITTED isolation, instead of EvalPlanQual we built an internal statement-retry mechanism that does not miss rows and thus alleviates the need for application-level retries.

1 Intents are new versions of a row that also act as exclusive locks.

2 Under SERIALIZABLE isolation, CockroachDB retries entire transactions when a serializable history cannot be achieved through other means. In this case we’re using a single-statement implicit transaction, so CockroachDB appears to have the same behavior under both SERIALIZABLE and READ COMMITTED isolation. But note that if the scenario were slightly different, say a move of a player to the Gophers rather than a swap, the two isolation levels would show different behavior.

3 This article focuses on PostgreSQL, but it’s worth noting that MySQL (using InnoDB) can also miss rows with some tweaks to the scenario. Of the databases we’ve studied, so far Oracle has the most similar UPDATE behavior under READ COMMITTED isolation to CockroachDB.