I periodically need to make changes to tables… mostly adding columns. Very simple with the alter table command… But my tables have up to 40 million rows now and they are growing fast… So those alter table commands take several hours… Since I’m using amazon RDS [sic], I can’t have slave servers to play with and then promote to master. So my question is if there’s a way to do this with minimal downtime? I don’t mind an operation taking hours or even days if users can still use the db…”
— Stack Overflow serverfault [^1]
The above query was posted back in 2010, but the anxiety around downtime for schema changes remains.
When we were designing CockroachDB’s schema change engine, we wanted to build it better. We wanted to provide a simple way to update a table schema (just run ALTER TABLE) and not have an application suffer any negative consequences – downtime included. We wanted schema changes to be a built-in feature of CockroachDB, requiring no additional tools, resources, or ad hoc sequencing of operations; and all this, with no effect on application read/write latency.
In this post, I’ll explain our online schema change strategy. I’ll discuss how we manage changes to schema elements like columns and indexes without forcing a downtime.
A schema change involves updating the schema and adding or deleting the table data associated with the change. Two fundamental distributed database features make this particularly tricky:
Our solution for maintaining a consistent distributed schema cache and consistent table data embraces the concurrent use of multiple versions of the schema, allowing the rollout of a new schema while the older version is still in use. It backfills (or deletes) the underlying table data without holding locks. This solution is derived from the work done by the F1 team at Google.
A schema element (which could be an index or column, although in the rest of this post we’ll focus on the case of an index) has associated data that can be deleted, written, or read by the SQL DML commands (e.g., INSERT, UPDATE, DELETE, SELECT). The strategy CockroachDB uses to rollout a new index involves granting delete, write, and read capabilities to these DML commands, one after the other in a staged manner rather than all at once.
Adding a new index follows these discrete stages explained later:
A capability is granted along with all prior granted capabilities in a new version of the schema. For correctness, a new capability must be granted only after the entire cluster is using a schema containing all prior granted capabilities. The process therefore pauses before each stage, allowing prior granted capabilities to propagate to the entire cluster before the next one can be granted.
Dropping an index runs the process in reverse:
Similarly, a capability is revoked only after the previous revocation has propagated across the entire cluster.
This capability is granted by placing the index in the DELETE_ONLY state. Armed with this capability the SQL DML commands limit themselves:
A node granted the write capability for an index at the next stage can trust the entire cluster to be using the delete capability for the index. Thus, when it receives an INSERT command and inserts an index entry for a row, another node with only the delete capability on seeing a DELETE command for the same row will correctly delete the index entry for the row. The index will avoid getting poisoned with a dangling index entry.
When a schema change drops an index, the associated index data is purged only after the successful withdrawal of the write capability across the cluster; the entire cluster has only the delete capability allowing a safe purge.
This capability is granted by placing the index in the WRITEANDDELETE_ONLY state, granting both delete and write capabilities.
The index backfill runs only after the entire cluster is capable of writing. An INSERT command received on any node while a backfill is in progress creates a new row with a legitimate index entry, and doesn’t depend on the separate backfill process to create an index entry for the row. There will be no missing index entries after the backfill is complete.
This last capability is granted by making the index active, and turns on the full use of the index by all commands.
Between each stage of the schema change process, the entire cluster is allowed to converge to the newest version of a table schema. A simple schema caching mechanism might use a 5 minute TTL, but that forces the schema change process to wait for many minutes before trusting that the latest version is the sole one active and a capability fully granted/revoked. In CockroachDB, we developed leases on schema versions to speed up cluster convergence to the latest schema version, which in turn accelerates the schema change process.
Before using a table schema for a SQL DML command, the node running the command obtains a read lease for the schema valid for a significant period of time (on the order of minutes). When an updated table schema version becomes active, it is broadcast across the cluster, alerting nodes to the new version and causing them to actively release their leases on the old version. If some nodes are unhealthy and do not actively release their leases, the rollout will wait for the leases to expire and will simply get delayed.
The schema change strategy is kept simple by having the leases follow two rules:
For a more detailed discussion on table leases, check out this RFC on our Github repo.
The schema change is guided to completion by the node executing the schema change SQL command. Since schema change operations are potentially long-running, they need to be restartable when the node performing them dies. Every node runs a schema changer goroutine capable of running any incomplete schema change. Before executing a schema change, a schema changer goroutine first acquires an exclusive write lease on the table, giving it license to be the only one guiding the schema change.
In CockroachDB, online schema changes are are easy to use, and the schema change rollout process is safe, fast, and reliable. Change is inevitable, and it’s time that you not worry about it!
We’d like to thank the folks from the F1 team at Google for publishing a similar implementation for online schema changes from which we gained much of the inspiration.[^1]: Apptree. Stack Overflow. “Modifying columns of a very large mysql table with little or no downtime.” August 26, 2010.