Technical Advisory a104309

On this page Carat arrow pointing down

Publication date: February 21, 2024

Description

A rangefeed bug may cause a checkpoint to be emitted prematurely in rare cases, before all writes below the checkpoint timestamp have been emitted. This could occur if nodes are overloaded or if long-running transactions (several seconds) are involved, such that intent resolution only replicates to a follower running a rangefeed more than 10 seconds after the transaction began. If a cluster using changefeeds experiences this bug, changefeeds will omit these write events, and the following error will be logged:

icon/buttons/copy
cdc ux violation: detected timestamp ... that is less or equal to the local frontier

Across a large sample of clusters using changefeeds over a month, 3.8% of clusters had experienced this bug, and a total 0.000064% of events were omitted (one in a million).

The bug is caused by a race condition that is typically triggered by transactions that run for several seconds or by follower replicas that lag significantly behind the leaseholder, such as when a node is overloaded or a network is disrupted. The rangefeed must be running on a follower replica that applies intent resolution long after the original transaction has committed and garbage-collected its intents and transaction record on the leaseholders (at least 10 seconds after the transaction begins).

The specific conditions to trigger the bug are:

  1. An explicit or cross-range transaction commits, then asynchronously resolves all of its intents and removes its transaction record on all relevant leaseholders.

  2. A follower replica running a rangefeed replicates the transaction's intent writes, but not yet the resolution of the intents. More than 10 seconds must elapse between when the transaction begins and when the intent resolution is replicated to the follower. (The sum of transaction duration + intent resolution duration + transaction record GC + replication to follower must be greater than 10 seconds to trigger the bug.)

  3. The follower replica's closed timestamp has advanced beyond the transaction's write timestamp. More time than the closed timestamp target duration (3 seconds by default) must elapse from the transaction's write timestamp until the intent resolution is replicated to the follower, and all other writes up to the write timestamp must have been replicated to the follower.

  4. If the above conditions are satisfied, the follower attempts to push the transaction to advance its resolved timestamp and emit a checkpoint, but the follower does not find a transaction record. The follower operates as though the transaction was aborted, and allows its resolved timestamp to advance above the transaction's write timestamp, emitting a checkpoint even though the intents have not yet been emitted.

  5. When the intent resolution is finally replicated to the follower, it emits the write events with timestamps below the previous checkpoint.

  6. The changefeed processor detects these events below the previous checkpoint and discards them instead of emitting them.

All CockroachDB versions prior to and including the following are impacted by this bug:

  • v2.1.11 - v22.2.17
  • v23.1.0 - v23.1.14
  • V23.2.0

Versions prior to 22.2 are no longer eligible for maintenance support, and this bug will not be fixed in those versions.

Statement

This is resolved in CockroachDB by PR #117612, which uses a barrier command to ensure that all historical and ongoing range writes have been applied to the local replica and emitted before the resolved timestamp is advanced and a checkpoint is emitted.

The fix has been applied to maintenance releases of CockroachDB:

However, the initial fix introduced a bug that could cause a rangefeed’s resolved timestamp to stop advancing. For details, diagnostic steps, and mitigations, refer to the note about the bug. That bug is fixed in the following versions:

Users are encouraged to upgrade to a version that contains both fixes.

This public issue is tracked by Issue #104309.

Mitigation

Users are encouraged to upgrade to v22.2.18, v23.1.15, v23.2.1, or a later version that includes the fix.

The log message cdc ux violation: detected timestamp ... that is less or equal to the local frontier indicates that a cluster has been affected by this bug. One message will be logged per omitted event. The message will include the job ID of the affected changefeed (as job=...), as well as the MVCC timestamp of the transaction whose writes were omitted (detected timestamp ...). If the data is still present in a table, the following query may allow you to retrieve it:

icon/buttons/copy
SELECT * FROM table WHERE crdb_internal_mvcc_timestamp = {timestamp}

Replace {timestamp} with the timestamp expressed in nanoseconds with the logical component after the decimal point. For example, to query for the logged timestamp 1707854079.38430848,2:

  1. Multiply 1707854079 by 1 × 109 (1 billion).
  2. Add 38430848 to the result.
  3. Append the portion of the timestamp (2 in this example) after the comma to the right of the the decimal point. In this example, the converted timestamp is 1707854079038430848.2.

If data is found, it can be re-emitted in two ways:

  • By issuing an UPDATE command with the same values.
  • By restarting the changefeed with initial_scan set to yes and without a cursor timestamp. This will perform a full export of all current data across the changefeed.
  • By restarting the changefeed and providing a cursor timestamp. If a cursor timestamp is provided, an initial scan will not be performed. Instead, updates above the cursor timestamp will be emitted. The cursor timestamp can be initialized below the timestamp of omitted writes (refer to the reference documentation for CREATE CHANGEFEED). However, specifying a cursor timestamp farther in the past than the gc.ttlseconds replication zone variable, which defaults to 14400 seconds (4 hours), results in an error.

Note:
This fix introduces a bug that could cause a rangefeed’s resolved timestamp to stop advancing. The corresponding changefeed will appear to be stalled in RUNNING state in certain conditions: If a rangefeed is running on a follower on a recently-merged range, and the rangefeed encounters an aborted transaction, then the resolved timestamp may stall. Events such as row updates will still be emitted as normal, but new checkpoints will not be emitted.

That bug is fixed in the following versions:

Users are encouraged to upgrade to a version that contains both fixes. If you are not yet able to upgrade to such a version, refer to the following information to help diagnose and work around rangefeed or changefeed stalls caused by the bug introduced by this advisory's initial fix.

After upgrading to a version with the fix for this advisory, a104309, your cluster may encounter a stalled rangefeed or changefeed even if it was not impacted by the original rangefeed bug disclosed in this advisory. Monitor your cluster using the following mechanisms. For more details, refer to Monitor and debug changefeeds.

  • Monitor cluster logs for messages like:

    icon/buttons/copy
    pushing old intents failed: range barrier failed, range split
    
  • Monitor the following cluster metrics:

    • changefeed.max_behind_nanos
    • changefeed.checkpoint_progress
  • Monitor the highwatermark column in the output of the SHOW CHANGEFEED JOBS SQL command. If it stops advancing, this indicates a stall.

If your cluster experiences a stalled rangefeed or changefeed after upgrading, you can work around the issue by pausing and resuming the stalled changefeed’s job. This may cause temporary disruption to the changefeed. To prevent recurrence, you can temporarily disable range merges by setting kv.range_merge.queue.enabled to false only until a fix is available.

As an alternative to avoid disruption to the changefeed, you can temporarily disable kv.rangefeed.push_txns.barrier.enabled to disable the fix to this advisory, a104309, until a fix to the stalled rangefeed bug is available.

This issue is tracked by Issue #119536.

Impact

Changefeeds may incorrectly omit events in rare cases and log an error.

Questions about any technical alert can be directed to our support team.


Yes No
On this page

Yes No