Publication date: February 12, 2020
After performing certain DDL statements, the cluster may reach a state of high memory usage and eventually a general lack of availability.
After performing DDL operations that involve dropping indexes, including dropping a child table of an interleaved parent, or when restoring a backup taken shortly after such DDL operations, some or all nodes in the cluster may start seeing very high memory utilization and severely degraded query performance.
An identifying symptom is frequent log messages of the format:
job with ID <job-id> does not exist, where
<job-id> is a string of digits. While a cluster is emitting these logs, it will use more and more memory until it generally becomes unavailable.
This condition has been eliminated in CockroachDB versions v19.1.8 and v19.2.4. All deployments are strongly recommended to upgrade as preventative measure. A cluster already affected by the symptoms is also expected to recover automatically after upgrading to these patch revisions.
The public issue is tracked as #44299.
In case a cluster has not experienced the symptoms of the issue yet, the following steps can be taken to reduce the risk of the problem occurring:
- Avoid restarting nodes while there are pending schema changes that drop tables or indexes, or before the GC ttl duration has expired after such DDL, when the delayed drop is finally processed.
- Avoid taking a backup after dropping tables or indexes and before the GC ttl duration has expired, when the delayed drop is finally processed.
- Avoid restoring a backup when it is unclear whether the backup is clear of pending schema changes or delayed drops.
- Ensure that the jobs.retention_time cluster setting remains greater than the GC ttl duration of every table.
Additionally, operators should set up close monitoring of their log files across the entire cluster to detect the moment when the problem starts to occur. This can be done by setting up an alert using the regular expression
job with ID \d+ does not exist with a threshold of appearing more than 50 times per minute. If or when this alert is triggered, the cluster must be upgraded to the recommended patch revision to recover.
If a cluster is affected by the symptoms and the symptoms do not dissipate after an upgrade, please reach out to our support team.
All deployments running CockroachDB v19.1 up to and including v19.1.7 and v19.2 up to and including v19.2.3 are affected. Deployments using v2.1 and prior are not affected.
Questions about any technical alert can be directed to our support team.