Writing History: MVCC range tombstones

This is part 3 of a 3-part blog series about how we’ve improved the way CockroachDB stores and modifies data in bulk (here is part 1 and here is part II). We went way down into the deepest layers of our storage system, then up to our SQL schema changes and their transaction timestamps - all without anybody noticing (or at least we hope!)

•••

Recall from part 1 that MVCC keeps a history of updates to a key (row) over time, such that it’s possible to read its past state. This history is used for a wide range of functionality: transaction isolation, historical AS OF SYSTEM TIME queries, incremental backups, changefeeds, cluster-to-cluster replication, and so on. MVCC deletes are therefore soft deletes: rather than removing a key directly, it is marked as deleted by writing a new tombstone version. Only when the tombstone has aged past the MVCC garbage collection threshold is the key finally removed from the database.

However, these tombstones present a problem for bulk deletions, where we delete large amounts of data. For example, if we were to drop a table with a billion rows, we would now have to write a billion new tombstones. Not only would this take a very long time, but it would counterintuitively also take up a lot of additional disk space.

Because of this, some CockroachDB bulk deletions “cheat” by directly removing the data in the storage engine — primarily schema deletions (e.g. table drops) and import job cancellation. While efficient, this can break functionality that relies on the MVCC history, requiring careful coordination. In particular, it also prevented us from implementing cluster-to-cluster replication (as described in part 1), since we need to keep a record of the operations to replicate them, and also be able to undo some of them in the target cluster when switching it over to a live cluster. This cheating also leads to a confusing user experience: for example, a dropped table is simply hidden from view and scheduled for removal when it reaches the garbage collection threshold — in the meanwhile, the data is still considered live, e.g. in statistics. Furthermore, only non-transactional admin operations can use these bulk deletes — SQL DELETE queries must write individual point tombstones, since they have to satisfy transactional guarantees.

In order to efficiently delete large amounts of data while also retaining the MVCC history, we decided to implement MVCC range tombstones rather than writing individual point tombstones for every key, these can mark an entire keyspan as deleted with a single (cheap) write operation, regardless of the number of keys within it. While conceptually simple, this was a fundamental change to the core MVCC data structures, and required extensive changes throughout the entire MVCC and storage layer. To limit the scope of the project, we decided to initially not support them in transactions, since that would require major changes to the transaction engine as well (specifically support for ranged write intents). This still allowed us to get rid of the “cheaters”, and always maintain an accurate MVCC history, but unfortunately we can’t yet use them for more efficient SQL DELETE queries.

Figure: left shows data deleted by multiple point tombstones, right deleted by a single range tombstone.

We’ve actually been down this path before, when we implemented LSM range tombstones in our Pebble storage engine. In fact, those are what we use to “cheat”, allowing us to efficiently remove large amounts of data directly in the storage engine. However, they sit at a lower level of the stack: Pebble doesn’t know anything about MVCC, and LSM range tombstones are final — once written, the data is effectively gone, so there’s no way to read any MVCC history below them. Even so, it provided us valuable experience with (and healthy respect for) the challenges involved with range tombstones, and the MVCC range tombstone implementation was heavily inspired by Pebble’s LSM range tombstones.

Rather than focusing solely on MVCC range tombstones however, we implemented a more general MVCC range key concept, which allows storing an arbitrary value for a keyspan at a specific timestamp. This was primarily with an eye towards adding ranged write intents, which will be needed to use MVCC range tombstones in transactions (and thus in SQL delete queries), and it also allowed us to associate various internal metadata with MVCC range tombstones. In turn, these MVCC range keys are built on an even more general non-MVCC range key concept that was implemented in the Pebble storage engine.

Pebble range keys

Our previous experiences with LSM range tombstones, both within Pebble and within RocksDB before that, taught us that ranged operations within a LSM are complicated and full of subtleties. We decided to implement a generalized form of LSM range keys to ensure that future ranged operations (eg, ranged transaction intents) could leverage this project’s work without duplicating complicated and subtle logic.

Pebble’s new generalized range keys add new primitives to the LSM. They introduce three new range key write operations: set, unset and delete. All three operations take a keyspan that defines the range over which the operation takes effect. A set upserts an individual range key over a keyspan. An unset removes an individual range key over a keyspan. A delete removes all range keys within a keyspan.

A set operation takes a secondary key called a “suffix” (possibly zero-length) and a value. For example, a RangeKeySet([a,z), color=yellow) creates a new range key setting color to the value yellow over the keyspan [a,z). Range keys are defined over the same keyspace as ordinary point keys but have a parallel existence, and they do not alter or overwrite ordinary point keys. A point key-value pair b=foo and the range key [a,z): color=yellow can coexist, and may be surfaced alongside each other by an iterator configured to surface both.

The Pebble iterator has been extended to support iteration over range keys, either in isolation or interleaved with point keys. Interleaved iteration allows the iterator user to observe a point key and all overlapping range keys at the same time. Interpretation of the semantics of user-defined range keys is up to the user. These generalized range keys allow the user to perform arbitrary bulk operations modeled as range keys, and the user reconciles the point and range keys at iteration time.

func (i *Iterator) HasPointAndRange() (hasPoint, hasRange bool) func (i *Iterator) RangeKeyChanged() bool func (i *Iterator) RangeBounds() (start, end []byte) func (i *Iterator) RangeKeys() []RangeKeyData type RangeKeyData struct { Suffix []byte Value []byte }

Multiple range keys may exist over the same keyspan, but only at unique suffixes. If two overlapping range keys have equal suffixes, the newer one overwrites the older one. When two range keys overlap with nonidentical bounds (eg, [a,c) and [b,d)), the observable bounds of the range keys are fragmented to the unique overlap points (eg, [a,b), [b,c) and [c,d)).

Figure: range key fragmentation of two overlapping range keys at different timestamps resulting in three fragmented range keys

This allows the iterator to present only one set of bounds at a time, and all range keys defined over those bounds. This simplifies bounds handling in Pebble and in the client code. Additionally, it allows Pebble to internally fragment range keys as they’re compacted down the LSM. Range keys are able to be stored in the same sstables as the point keys they cover, preserving locality of access.

A range key’s secondary key, or “suffix”, is so named because of its relationship to MVCC timestamps. Pebble doesn’t dictate any particular key encoding, although it imposes some constraints when certain features are in use. A user provides a Comparer that encapsulates a set of functions that understand the user’s key encoding and are used to compare keys, generate keys and split keys to separate MVCC timestamps. A user that implements MVCC, like CockroachDB, may provide a Split function on their Comparer that splits a user key into two portions: a prefix and a suffix. The suffix is expected to encode a MVCC timestamp and be comparable to other suffixes through the ordinary user-supplied Comparer functions. This limited knowledge of MVCC is used by Pebble, for example, to build bloom filters that ignore MVCC timestamps.

In the context of range keys, the “suffix” field supplied to a set or unset operation should hold a MVCC timestamp that’s comparable to the suffixes on point keys. In Cockroach, we use this field to store the MVCC timestamp of a MVCC range key, and the range key’s value encodes the type of operation performed.

Masking

Cockroach’s motivating use case for range keys was the implementation of MVCC range tombstones. A MVCC range tombstone marks a broad keyspan as deleted at a particular timestamp. The deleted keys are kept around for servicing reads at historical timestamps but are known to be deleted at the current timestamp. However most queries don’t want a historical view of the database; they want the current view. A broad swath of potentially millions of deleted keys pose a performance obstacle to real-time queries. New queries may write new, live keys within the deleted span resulting in a few live keys interspersed with a multitude of deleted keys. Real-time queries must skip over all these deleted keys to find the live keys.

In order to make the implementation of MVCC range tombstones efficient, Pebble exposes the ability to configure iterators with range key ‘masking’. Enabling masking configures an iterator to treat range keys as masks. Point keys that are contained within the bounds of a range key and that encode a MVCC timestamp below the range key’s encoded timestamp are masked. The Pebble iterator will transparently hide these point keys from iteration. Masking shifts the burden of hiding keys deleted by MVCC range tombstones into Pebble itself, but as described, still requires internally iterating over all the deleted keys. To improve, Pebble supports configuring masking to make use of a Pebble feature called block-property filters introduced in Cockroach v22.1.

Cockroach configures Pebble to use a feature called block property collectors to annotate data blocks, index blocks and sstables with the range of MVCC timestamps of contained keys. A data block containing keys a@9, a@2, b@7, and c@3 would be annotated with the interval [@2-@10). Cockroach uses these annotations to perform time-bound iteration such that an iterator only visits data blocks, index block or sstables containing keys within a given time span. Incremental backups make use of time-bound iteration to backup only keys written since the last backup.

Range key masking supports reconfiguring an iterator to use a block property encoding MVCC timestamps to skip data blocks, index blocks or whole sstables that exclusively contain keys older than a range key’s MVCC timestamp and wholly contained within the range key’s bounds.

In the above diagram, the range key [a-c)@7 masks points apple@5 and blueberry@3. Point banana@9 is not masked, because its MVCC timestamp is above the range key’s, and point cherry@2 is not masked, because it falls outside the bounds of the range key. If the keys apple@5 and blueberry@3 are the only keys in a data block, the Pebble iterator is able to skip the entire data block without ever loading it. Range key masking with a block property filter can reduce the latency of scanning in a keyspace with a MVCC range tombstone by several orders of magnitude.

MVCC range tombstones

MVCC range keys share most of the properties of Pebble range keys, including the fragmentation structure and stacking. The basic range key representation, specifying a keyspan at a timestamp, is:

type MVCCRangeKey struct { StartKey roachpb.Key EndKey roachpb.Key Timestamp hlc.Timestamp }

A range key maps to an arbitrary binary value, and range tombstones are signified by an empty value, just like regular point tombstones. They’re sometimes combined as a ranged key/value pair:

type MVCCRangeKeyValue struct { RangeKey MVCCRangeKey Value []byte }

But they’re most often represented as a fragment stack, equivalent to the underlying Pebble data structure that we saw in the previous section:

type MVCCRangeKeyStack struct { Bounds roachpb.Span Versions []MVCCRangeKeyVersion } type MVCCRangeKeyVersion struct { Timestamp hlc.Timestamp Value []byte }

In addition to the fragmentation structure produced by Pebble, MVCC range keys are also fragmented along CockroachDB’s Raft range boundaries: a range key written across multiple Raft ranges will appear as separate (but adjoining) range keys, one per Raft range. They will also split and merge along with the Raft ranges. This fragmentation can be physical (if the Raft ranges are located on separate nodes) or logical (if the Raft ranges are colocated on the same node), but in either case it arises from the fact that storage engine reads and writes must be truncated to the Raft range bounds.

Like in Pebble, MVCC range keys and point keys exist separately, but uses interleaved iteration to overlay range keys onto overlapping point keys. It is then up to the reader to interpret them relative to each other. As a simple example, let’s say we want to scan the current live (non-deleted) rows in a table, i.e. a SELECT * FROM table. Very simplified, the low-level storage code for this might look something like this:

iter := engine.NewMVCCIterator(IterOptions{ KeyTypes: PointsAndRanges, LowerBound: tableStart, UpperBound: tableEnd, }) // Seek to the table start, then step across the latest version of each key // in the table until the iterator is exhausted. for iter.Seek(tableStart); iter.Valid(); iter.NextKey() { // If this position doesn't have a point key (only a range key), skip it. hasPointKey, hasRangeKey := iter.HasPointAndRange() if !hasPointKey { continue } // If this position has a range tombstone (empty value), and its timestamp // is higher than the point key, then the point key is deleted. Skip it. if hasRangeKey { latest := iter.RangeKeys().Versions[0] if len(latest.Value) == 0 && latest.Timestamp > iter.Key().Timestamp { continue } } // If the point key is a live key (not a tombstone, i.e. empty value) then // add it to the result. if value := iter.Value(); len(value) > 0 { rows = append(rows, value) } }

In the actual implementation, however, we often make use of Pebble’s range key masking to omit point keys that are covered by range keys, making this kind of iteration significantly more efficient.

Most of the code in and surrounding the MVCC layer had to be extended with similar logic to take MVCC range tombstones into account. This acts as a significant complexity multiplier, and we didn’t want to burden the higher layers with this, so we’ve attempted to confine range tombstone handling to the core storage code. We mostly accomplished this via two measures: tombstones of any kind are rarely exposed across RPC boundaries (so the SQL engine in particular doesn’t have to deal with them), and range tombstones can often be represented as synthetic point tombstones where they cover a point key.

MVCC range keys share most properties of MVCC point keys, and provide mostly equivalent guarantees such as linearizability (trigger uncertainty restarts when present in a reader’s uncertainty interval) and serializability (can’t write below committed keys, and interact with write intents, latches, and the timestamp cache in the same way), except for not being transactional and thus not atomic across Raft ranges. They also have mostly equivalent metrics, such as key/value counts and sizes (e.g. `rangekeycount`), and contribute similarly to e.g. `livebytes` and `totalbytes`.

They also integrate with other components in mostly equivalent ways:

SQL: MVCC tombstones of any kind are never exposed to SQL clients or the SQL engine, so they have no impact on them.
Backup: range tombstones are backed up and restored in the same way as point tombstones, both for full and incremental backups.
Rangefeeds and changefeeds: range tombstones are emitted across rangefeeds as a new DeleteRange event type, containing the keyspan and timestamp of the deletion. These will not yet be emitted across changefeeds, because we don’t currently write range tombstones to online tables, but they might once SQL starts taking advantage of them, at which point changefeed sinks must be extended to handle them.
Garbage collection: data deleted by range tombstones is garbage collected in the same way as that deleted by point tombstones, as are the range tombstones themselves. Additionally, entire Raft ranges that are deleted by range tombstones (e.g. following table drops) are garbage collected more efficiently, by writing Pebble LSM range tombstones across the entire Raft range and eagerly garbage collecting adjacent Raft ranges that were also completely deleted. In particular, this ensures that the subsequent Pebble compaction can remove as much data as possible in a single cycle, reducing write amplification.

MVCC range tombstones are available in CockroachDB 22.2, but disabled by default since this infrastructure is new and rather complex. Adventurous users can enable them via the cluster setting storage.mvcc.range_tombstones.enabled, but (as with all new code) they come with risks, and could in the worst case cause data loss. They are currently only used for two purposes: schema deletions (e.g. table drops) and cleanup of canceled import jobs.

As the implementation matures, we plan to gradually roll them out across our cloud infrastructure, before enabling them by default in CockroachDB 23.1.

For more information on MVCC range tombstones, see the Pebble range key RFC and MVCC range tombstone tech note. If you’ve made it all the way through parts I, II, and III of this blog series there should be some kind of award. We really appreciate your time and your interest. Please reach out to us on our community slack to say hello and to offer your feedback.