Protected Timestamps: For a future with less garbage

Protected Timestamps: For a future with less garbage
[ Guides ]

CockroachDB: The Definitive Guide

We wrote the book on distributed scale. Literally.

Free O'Reilly Book

CockroachDB relies heavily on multi-version concurrency control (MVCC) to offer consistency guarantees. MVCC, by design, requires the database to keep multiple revisions of values. A buildup of these revisions can impact query execution costs for certain workloads. In this article, we introduce the idea of Protected Timestamps: a novel way for CockroachDB to efficiently retain only relevant revisions, while safely deleting (garbage collecting) old ones.

MVCC is implemented by our storage layer (Pebble). An update to a key does not overwrite the older version of the value, but instead creates a newer version of the value at a higher timestamp. Unfortunately, disk space is not unlimited, and with MVCC comes the need to clean up obsolete versions of data. Accumulation of expired MVCC values can impact query execution costs for various workloads, so we wish to delete them as eagerly as possible.

On the other hand, long-running operations such as backups or analytical queries *need* historical data to generate a consistent snapshot of the database as it changes over time. Asking customers to choose between a more performant database, and one with resilient backups is bad, so we won’t. 

In this post, we’re going to talk about an internal subsystem of CockroachDB, what we call protected timestamps, and how they are leveraged for finer-grained control over the garbage collection of keyspans to ensure the safety of operations that rely on historical data. We will touch on how CockroachDB’s transition to a multi-tenant architecture motivated a rewrite of a system that has existed since 20.1, and how this redesign will enable users to run their database with a more aggressive garbage collection configuration than the current default of 25 hours.

Introduction to garbage collection

CockroachDB stores all data in a monolithic, sorted map of keys and values. This monolithic keyspace is divided into contiguous chunks called ranges such that each key is found in one range. A range is replicated across the nodes in a cluster, and the copies of a range are known as replicas. A range (and all its replicas) belongs to a replication zone that controls various configurations including how long older revisions of data in that range are kept around before becoming eligible for garbage collection (GC). This duration is referred to as the range’s garbage collection time-to-live (GC TTL) and can be configured by altering the replication zone that applies to the range. 

By default, all ranges in CockroachDB have a GC TTL of 25 hours. A single MVCC version of a key is considered expired if the version’s MVCC timestamp is older than the range’s view of `now()` less its GC TTL, and there is a newer version of the same key. This timestamp, below which all older versions of a key are considered expired, is called the garbage collection threshold (GC threshold). CockroachDB does not allow reads to be performed below the GC threshold of a replica. 

Garbage collection threshold

We now have all the concepts required to build an intuition for how garbage collection in CockroachDB works. Every replica on a node is periodically processed by the MVCC GC queue running in the node’s key-value server. This queue is responsible for computing the GC threshold that applies to all keys in that replica, identifying the MVCC revisions that have expired relative to this threshold, and determining if there is adequate data to run garbage collection on these expired revisions. Details about the heuristics used by the GC queue are outside the scope of this blog post; for simplicity we will consider all expired revisions eligible for garbage collection.

Protection from garbage collection

As the name suggests, the protected timestamp subsystem exposes primitives to protect MVCC revisions at or above certain timestamps from garbage collection. So how does one go about protecting a key’s revisions from GC? As is true for everything in a database, the answer is not so simple.

Protected timestamp record

A protected timestamp record (PTS record) is the basic unit of the protected timestamp subsystem. A record is uniquely identifiable, and is associated with a timestamp and a target that the PTS record is responsible for protecting from GC. As we know by now, GC decisions are made at the replica level. Oddly enough, a PTS record does not target a replica but instead protects one of the schema objects supported in the CockroachDB catalog hierarchy i.e. a  “cluster”, database, or a table. Eventually, a table is associated with replicas but the hierarchical nature of schema objects allows protections to cascade to schema objects that are lower in the catalog hierarchy. In other words, a protected timestamp record that is protecting a database would also protect all existing tables in the database as well as new tables that will be created in the future. 

*Take note of this design decision as it will make sense when we discuss how backups use protected timestamp records!

Reconciliation of SQL state into KV state

In the previous section we glossed over the PTS record target and its association with replicas. Let’s take a quick detour to unpack this association as it is rather interesting in a multi-tenant architecture. CockroachDB can be broadly divided into a SQL layer and a key-value (KV) or storage layer. While schema objects such as databases and tables are constructs in the SQL process, replicas, the MVCC GC queue, and garbage collection operate in the KV process of the database. In a multi-tenant architecture the SQL and KV layers run in isolated processes on different machines, and so we rely on yet another subsystem to translate, transport and apply SQL state to the KV process.

In the context of protected timestamps we have a “watcher” that tracks the addition and deletion of PTS records in the SQL layer. On every update, the target of the PTS record is translated to the keyspans it applies to. For example, a database target will descend the catalog hierarchy and find all table keyspans that lie within that database. With the knowledge of what keyspans we want to protect, we send an RPC to the KV layer to persist information about the PTS record in the ranges containing the keyspans. This protection will be asynchronously replicated to all replicas of the range across nodes. It is in this manner that a PTS record in the SQL layer is associated with the appropriate replicas in the KV layer.

You shall not pass

Now that the protection has made its way to the target replicas, we can talk about how it influences garbage collection decisions for the keys in a particular replica. As mentioned before, a PTS record is associated with a timestamp, above which the record is expected to protect data from garbage collection. Every time the replica is processed by the MVCC GC queue, the queue checks for the existence of a PTS record that is protecting the replica. Note, there can be multiple PTS records protecting different (or the same) timestamps. If the queue finds a valid PTS record, that is a record protecting a timestamp that has not already fallen below the replica’s GCThreshold, that timestamp becomes the effective GCThreshold of the replica. 

In other words, garbage collection is not allowed to progress on the keys in this replica beyond the protected timestamp. The PTS record holds up GC until the record is explicitly removed, following which GC will progress to the next effective GC threshold of the replica. In this manner, records can be strategically placed to protect only the data we need, while other keys are garbage collected according to the user-configured GC TTL.

Protection in production

Fine-grained control over the garbage collection of data allows long-running processes in the database to keep older revisions of values for as long as they need them. In this section we’ll talk about one of the primary use cases of protected timestamps.

Backups

A backup is the primary disaster recovery tool offered by CockroachDB to take a consistent, logical snapshot of the database in a given window of time. CockorachDB offers two flavors of backup, a full and an incremental backup.

  • full backups export all the target data as of a given timestamp. 
  • incremental backups only export changes in the target data since the last backup in the chain.

Prior to 22.2, the execution of incremental backups was tightly coupled with the GC TTL set on the target of the backup (cluster, database, or table). This is because in order for an incremental backup to succeed, it must be able to read all revisions in the data from the previous backup in the chain. If data in that time window has been garbage collected, then the incremental backup would fail, as it would not be able to read all the changes since the previous backup. 

Failing backups are unacceptable to customers. For this reason, they would be strong-armed into setting a high enough GC TTL which then results in the unnecessary accumulation of garbage. Historically, this has been the primary reason Cockroach clusters default to a 25-hour GC TTL, a duration most folks consider too high! Protected timestamps pave the way for a future with less garbage.

From 22.2 onwards, scheduled backups ensure that the data to be exported is protected from garbage collection until it has been successfully backed up. Let us take an example of a schedule that runs a full backup of database foo at t1, and an incremental backup every 5 minutes after the full backup at t5, t10, and so on.

  1. The schedule writes a protected timestamp at t1 on the database foo, before running the full backup. This guarantees that the full backup will be able to read all revisions of keys in the database foo as far back as t1 since garbage collection on this database will not progress beyond t1.
  2. Once the full backup completes, the schedule will continue to hold on to the protected timestamp record until the incremental backup at t5 is scheduled to run. This is because the incremental backup needs to read all revisions in (t1, t5] to produce a consistent snapshot of the data.
  3. Once the incremental backup completes, the schedule will haul up the protected timestamp at t1 to t5, thereby allowing garabage collection on database foo to progress if need be. We haul up the protected timestamp since we have successfully backed up all data from [t1, t5] and no longer require older revisions in that timespan. The next incremental backup only needs data from t5 onwards.

This active chaining of protected timestamps means that you can run scheduled backups at a cadence independent from the GC TTL configured for the data. It resolves the primary blocker to a default GC TTL that is much shorter than 25 hours, all while keeping backups resilient.

In addition to backups, several other long-running operations in CockroachDB such as changefeeds, and index backfill schema changes also use protected timestamp records in a similar manner as explained above. In the future, we may add support for long-running SQL queries to also lay down PTS records to make their execution more resilient. Evidently, protected timestamps offer a powerful primitive with many use cases.

Sidebar: Remember how we decided to associate PTS records with schema objects instead of key spans or replicas? The motivation for this was that when the scheduler protects a cluster or a database to be backed up, the protection would apply not only to existing tables, but also to new tables that might be created in the future before the next scheduled backup. Without this property, a table created in between two scheduled backups could get garbage collected before the next scheduled backup has had a chance to export its data.

Sidebar: In some situations, users want to exclude a table’s row data from a backup. For example, you have a table that contains high-churn data that you would like to garbage collect more aggressively than the incremental backup schedule for the database or cluster holding the table. For that, users can set exclude_data_from_backup = true parameter with a CREATE TABLE or ALTER TABLE statement to mark a table’s row data for exclusion from a backup.

Summary

In this blog we introduced several concepts core to CockroachDB’s garbage collection machinery. We then explained how protected timestamps provide fine-grained control over the garbage collection of data. Finally, we talked about a use case of protected timestamps in backups, and how they pave the way for a future free of garbage. If you found this work interesting then you should definitely take a look at our three-part blog series called “Writing History”:

Part I: How we rebuilt bulk operations to preserve a history of changes

Part II: MVCC bulk ingestion and index backfills

Part III: MVCC range tombstones

Thanks for reading!

Keep Reading

Writing History: How we rebuilt bulk operations to preserve a history of changes

This is part 1 of a 3-part blog series about how we’ve improved the way CockroachDB stores and modifies data in …

Read more
Gotchas and solutions running a distributed system across Kubernetes clusters

```

I recently gave a talk at KubeCon North America -- “Experience Report: Running a Distributed System …

Read more
Writing History: MVCC range tombstones

This is part 3 of a 3-part blog series about how we’ve improved the way CockroachDB stores and modifies data in bulk ( …

Read more