Publication date: February 12, 2020.
After performing certain DDL statements, the cluster may reach a state of high memory usage and eventually a general lack of availability.
After performing DDL operations that involve dropping indexes, including dropping a child table of an interleaved parent, or when restoring a backup taken shortly after such DDL operations, some or all nodes in the cluster may start seeing very high memory utilization and severely degraded query performance.
An identifying symptom is frequent log messages of the format:
with ID <job-id> does not exist, where
<job-id> is a string of
digits. While a cluster is emitting these logs, it will use more and
more memory until it generally becomes unavailable.
This condition has been eliminated in CockroachDB versions v19.1.8 and v19.2.4. All deployments are strongly recommended to upgrade as preventative measure. A cluster already affected by the symptoms is also expected to recover automatically after upgrading to these patch revisions.
The public issue is tracked as #44299.
In case a cluster has not experienced the symptoms of the issue yet, the following steps can be taken to reduce the risk of the problem occurring:
- Avoid restarting nodes while there are pending schema changes that drop tables or indexes, or before the GC ttl duration has expired after such DDL, when the delayed drop is finally processed.
- Avoid taking a backup after dropping tables or indexes and before the GC ttl duration has expired, when the delayed drop is finally processed.
- Avoid restoring a backup when it is unclear whether the backup is clear of pending schema changes or delayed drops.
- Ensure that the jobs.retention_time cluster setting remains greater than the GC ttl duration of every table.
Additionally, operators should set up close monitoring of their log
files across the entire cluster to detect the moment when the problem
starts to occur. This can be done by setting up an alert using the
job with ID \d+ does not exist with a threshold
of appearing more than 50 times per minute. If or when this alert is
triggered, the cluster must be upgraded to the recommended patch
revision to recover.
If a cluster is affected by the symptoms and the symptoms do not dissipate after an upgrade, please reach out to our support team.
All deployments running CockroachDB v19.1 up to and including v19.1.7 and v19.2 up to and including v19.2.3 are affected. Deployments using v2.1 and prior are not affected.
Questions about any technical alert can be directed to our support team.