Transaction Retry Error Reference

On this page Carat arrow pointing down

When a transaction is unable to complete due to contention with another concurrent or recent transaction attempting to write to the same data, CockroachDB will automatically attempt to retry the failed transaction without involving the client (i.e., silently). If the automatic retry is not possible or fails, a transaction retry error is emitted to the client.

Transaction retry errors fall into two categories:

  • Serialization Errors indicate that a transaction failed because it could not be placed into a serializable ordering among all of the currently-executing transactions, as required under the default SERIALIZABLE isolation level. These errors are generally addressed with client-side intervention, where the client initiates a restart of the transaction, and adjusts application logic and tunes queries for greater performance.

    Note:
    Under READ COMMITTED isolation, serializable errors will usually not be returned to the client. If a READ COMMITTED transaction conflicts with another transaction such that a serialization error (RETRY_WRITE_TOO_OLD or ReadWithinUncertaintyInterval) is generated, the transaction attempts to resolve the error by internally retrying the latest conflicting statement at a new read timestamp.

  • Internal State Errors indicate that the cluster itself is experiencing an issue, such as being overloaded, which prevents the transaction from completing. These errors generally require both cluster-side and client-side intervention, where an operator addresses an issue with the cluster before the client then initiates a restart of the transaction.

All transaction retry errors use the SQLSTATE error code 40001, and emit error messages with the string restart transaction. Further, each error includes a specific error code to assist with targeted troubleshooting.

When experiencing transaction retry errors, you should follow the guidance under Actions to take, and then consult the reference for your specific transaction retry error for guidance specific to the error message encountered.

Overview

At the default SERIALIZABLE isolation level, CockroachDB always attempts to find a serializable ordering among all of the currently-executing transactions.

Whenever possible, CockroachDB will auto-retry a transaction internally without notifying the client. CockroachDB will only send a serialization error to the client when it cannot resolve the error automatically without client-side intervention.

The main reason why CockroachDB cannot auto-retry every serialization error without sending an error to the client is that the SQL language is "conversational" by design. The client can send arbitrary statements to the server during a transaction, receive some results, and then decide to issue other arbitrary statements inside the same transaction based on the server's response.

Actions to take

In most cases, the correct actions to take when encountering transaction retry errors are:

  1. Update your application to support client-side retry handling when transaction retry errors are encountered. Follow the guidance for the specific error type.

  2. Take steps to minimize transaction retry errors in the first place. This means reducing transaction contention overall, and increasing the likelihood that CockroachDB can automatically retry a failed transaction.

Client-side retry handling

Your application should include client-side retry handling when the statements are sent individually, such as:

> BEGIN;

> UPDATE products SET inventory = 0 WHERE sku = '8675309';

> INSERT INTO orders (customer, status) VALUES (1, 'new');

> COMMIT;

To indicate that a transaction must be retried, CockroachDB signals an error with the SQLSTATE error code 40001 (serialization error) and an error message that begins with the string "restart transaction".

To handle these types of errors, you have the following options:

Client-side retry handling example

For a conceptual example of application-defined retry logic, and testing that logic against your application's needs, see the client-side retry handling example.

Minimize transaction retry errors

In addition to the steps described in Client-side retry handling, which detail how to configure your application to restart a failed transaction, there are also a number of changes you can make to your application logic to reduce the number of transaction retry errors that reach the client application under SERIALIZABLE isolation.

Reduce failed transactions caused by timestamp pushes or read invalidation:

  • Limit the number of affected rows by following optimizing queries (e.g., avoiding full scans, creating secondary indexes, etc.). Not only will transactions run faster, lock fewer rows, and hold locks for a shorter duration, but the chances of read invalidation when the transaction's timestamp is pushed, due to a conflicting write, are decreased because of a smaller read set (i.e., a smaller number of rows read).

  • Break down larger transactions (e.g., bulk deletes) into smaller ones to have transactions hold locks for a shorter duration. For example, use common table expressions to group multiple clauses together in a single SQL statement. This will also decrease the likelihood of pushed timestamps. For instance, as the size of writes (number of rows written) decreases, the chances of the transaction's timestamp getting bumped by concurrent reads decreases.

  • Use SELECT FOR UPDATE to aggressively lock rows that will later be updated in the transaction. Updates must operate on the most recent version of a row, so a concurrent write to the row will cause a retry error (RETRY_WRITE_TOO_OLD). Locking early in the transaction forces concurrent writers to block until the transaction is finished, which prevents the retry error. Note that this locks the rows for the duration of the transaction; whether this is tenable will depend on your workload. For more information, see When and why to use SELECT FOR UPDATE in CockroachDB.

  • Use historical reads (SELECT ... AS OF SYSTEM TIME), preferably bounded staleness reads or exact staleness with follower reads when possible to reduce conflicts with other writes. This reduces the likelihood of RETRY_SERIALIZABLE errors as fewer writes will happen at the historical timestamp. More specifically, writes' timestamps are less likely to be pushed by historical reads as they would when the read has a higher priority level. Note that if the AS OF SYSTEM TIME value is below the closed timestamp, the read cannot be invalidated.

  • When replacing values in a row, use UPSERT and specify values for all columns in the inserted rows. This will usually have the best performance under contention, compared to combinations of SELECT, INSERT, and UPDATE.

  • If applicable to your workload, assign column families and separate columns that are frequently read and written into separate columns. Transactions will operate on disjoint column families and reduce the likelihood of conflicts.

  • As a last resort, consider adjusting the closed timestamp interval using the kv.closed_timestamp.target_duration cluster setting to reduce the likelihood of long-running write transactions having their timestamps pushed. This setting should be carefully adjusted if no other mitigations are available because there can be downstream implications (e.g., historical reads, change data capture feeds, statistics collection, handling zone configurations, etc.). For example, a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

Increase the chance that CockroachDB can automatically retry a failed transaction:

Transaction retry error reference

Note that your application's retry logic does not need to distinguish between the different types of serialization errors. They are listed here for reference during advanced troubleshooting.

Each transaction retry error listed includes an example error as it would appear from the context of the client, a description of the circumstances that cause that error, and specific guidance for addressing the error.

RETRY_WRITE_TOO_OLD

TransactionRetryWithProtoRefreshError: ... RETRY_WRITE_TOO_OLD ...

Error type: Serialization error

Description:

The RETRY_WRITE_TOO_OLD error occurs when a transaction A tries to write to a row R, but another transaction B that was supposed to be serialized after A (i.e., had been assigned a higher timestamp), has already written to that row R, and has already committed. This is a common error when you have too much contention in your workload.

Action:

  1. Retry transaction A as described in client-side retry handling.
  2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:

    1. Send all of the statements in your transaction in a single batch.
    2. Use SELECT FOR UPDATE to aggressively lock rows that will later be updated in the transaction.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_SERIALIZABLE

TransactionRetryWithProtoRefreshError: ... RETRY_SERIALIZABLE ...

New in v23.2: The error message for RETRY_SERIALIZABLE contains additional information about the transaction conflict which led to the error, as shown below. This error message can also be viewed in the DB Console by navigating to Insights PageWorkload InsightsTransaction Executions and clicking on the transaction ID to see the Failed Execution insight.

restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE  - failed preemptive refresh due to conflicting locks on /Table/106/1/918951292305080321/0 [reason=wait_policy] - conflicting txn: meta={id=1b2bf263 key=/Table/106/1/918951292305080321/0 iso=Serializable pri=0.00065863 epo=0 ts=1700512205.521833000,2 min=1700512148.761403000,0 seq=1}): "sql txn" meta={id=07d42834 key=/Table/106/1/918951292305211393/0 iso=Serializable pri=0.01253025 epo=0 ts=1700512229.378453000,2 min=1700512130.342117000,0 seq=2} lock=true stat=PENDING rts=1700512130.342117000,0 wto=false gul=1700512130.842117000,0
SQLSTATE: 40001
HINT: See: https://www.cockroachlabs.com/docs/v23.2/transaction-retry-error-reference.html#retry_serializable

Error type: Serialization error

Description:

At a high level, the RETRY_SERIALIZABLE error occurs when a transaction's timestamp is moved forward, but the transaction performed reads at the old timestamp that are no longer valid at its new timestamp. More specifically, the RETRY_SERIALIZABLE error occurs in the following three cases:

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read, and B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp, but it cannot do that when another transaction has subsequently written to some of the keys that A has read and returned to the client. When that happens, the RETRY_SERIALIZATION error is signalled. For more information about how timestamp pushes work in our transaction model, see the architecture docs on the transaction layer's timestamp cache.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B, and some other transaction C writes to a key that B has already read. Transaction B will get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client, and C has written data previously read by B.

  3. When a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2.

Failed preemptive refresh

In the three above cases, CockroachDB will try to validate whether the read-set of the transaction that had its timestamp (timestamp1) pushed is still valid at the new timestamp (timestamp3) at commit time. This mechanism is called "performing a read refresh". If the read-set is still valid, the transaction can commit. If it is not valid, the transaction will get a RETRY_SERIALIZABLE - failed preemptive refresh error. The refresh can fail for two reasons:

  1. There is a committed value on a key that was read by the transaction at timestamp2 (where timestamp2 occurs between timestamp1 and timestamp3). The error message will contain due to encountered recently written committed value. CockroachDB does not have any information about which conflicting transaction wrote to this key.
  2. There is an intent on a key that was read by the transaction at timestamp2 (where timestamp2 occurs between timestamp1 and timestamp3). The error message will contain due to conflicting locks. CockroachDB does have information about the conflicting transaction to which the intent belongs. The information about the conflicting transaction can be seen on the DB Console Insights page.

Action:

  1. Retry transaction A as described in client-side retry handling.
  2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:
    1. Send all of the statements in your transaction in a single batch.
    2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
    3. Use SELECT FOR UPDATE to aggressively lock rows for the keys that were read and could not be refreshed.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_ASYNC_WRITE_FAILURE

TransactionRetryWithProtoRefreshError: ... RETRY_ASYNC_WRITE_FAILURE ...

Error type: Internal state error

Description:

The RETRY_ASYNC_WRITE_FAILURE error occurs when some kind of problem with your cluster's operation occurs at the moment of a previous write in the transaction, causing CockroachDB to fail to replicate one of the transaction's writes. This can happen if a lease transfer occurs while the transaction is executing, or less commonly if you have a network partition that cuts off access to some nodes in your cluster.

Action:

  1. Retry the transaction as described in client-side retry handling. This is worth doing because the problem with the cluster is likely to be transient.
  2. Investigate the problems with your cluster. For cluster troubleshooting information, see Troubleshoot Cluster Setup.

See Minimize transaction retry errors for the full list of recommended remediations.

ReadWithinUncertaintyIntervalError

TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:
        read at time 1591009232.376925064,0 encountered previous write with future timestamp 1591009232.493830170,0 within uncertainty interval `t <= 1591009232.587671686,0`;
        observed timestamps: [{1 1591009232.587671686,0} {5 1591009232.376925064,0}]

Error type: Serialization error

Description:

The ReadWithinUncertaintyIntervalError can occur when two transactions which start on different gateway nodes attempt to operate on the same data at close to the same time, and one of the operations is a write. The uncertainty comes from the fact that we cannot tell which one started first - the clocks on the two gateway nodes may not be perfectly in sync.

For example, if the clock on node A is ahead of the clock on node B, a transaction started on node A may be able to commit a write with a timestamp that is still in the "future" from the perspective of node B. A later transaction that starts on node B should be able to see the earlier write from node A, even if B's clock has not caught up to A. The "read within uncertainty interval" occurs if we discover this situation in the middle of a transaction, when it is too late for the database to handle it automatically. When node B's transaction retries, it will unambiguously occur after the transaction from node A.

Note:

This behavior is non-deterministic: it depends on which node is the leaseholder of the underlying data range. It’s generally a sign of contention. Uncertainty errors are always possible with near-realtime reads under contention.

Action:

The solution is to do one of the following:

  1. Be prepared to retry on uncertainty (and other) errors, as described in client-side retry handling.
  2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:
    1. Send all of the statements in your transaction in a single batch.
    2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
  3. If you trust your clocks, you can try lowering the --max-offset option to cockroach start, which provides an upper limit on how long a transaction can continue to restart due to uncertainty.
Note:

Uncertainty errors are a sign of transaction conflict. For more information about transaction conflicts, see Transaction conflicts.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_COMMIT_DEADLINE_EXCEEDED

TransactionRetryWithProtoRefreshError: TransactionRetryError: transaction deadline exceeded ...

Error type: Internal state error

Description:

The RETRY_COMMIT_DEADLINE_EXCEEDED error means that the transaction timed out due to being pushed by other concurrent transactions. This error is most likely to happen to long-running transactions. The conditions that trigger this error are very similar to the conditions that lead to a RETRY_SERIALIZABLE error, except that a transaction that hits this error got pushed for several minutes, but did not hit any of the conditions that trigger a RETRY_SERIALIZABLE error. In other words, the conditions that trigger this error are a subset of those that trigger RETRY_SERIALIZABLE, and that this transaction ran for too long (several minutes).

Note:

Read-only transactions do not get pushed, so they do not run into this error.

This error occurs in the cases described below.

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read. B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B. Transaction B may get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client.

  3. When a transaction A is forced to refresh (change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

Action:

  1. The RETRY_COMMIT_DEADLINE_EXCEEDED error is one case where the standard advice to add a retry loop to your application may not be advisable. A transaction that runs for long enough to get pushed beyond its deadline is quite likely to fail again on retry for the same reasons. Therefore, the best thing to do in this case is to shrink the running time of your transactions so they complete more quickly and do not hit the deadline.
  2. If you encounter case 3 above, you can increase the kv.closed_timestamp.target_duration setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).
Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_ABORTED_RECORD_FOUND

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND) ...

Error type: Internal state error

Description:

The ABORT_REASON_ABORTED_RECORD_FOUND error means that the client application is trying to use a transaction that has been aborted. This happens in one of the following cases:

  • Write-write conflict: Another high-priority transaction B encountered a write intent by our transaction A, and tried to push A's timestamp.
  • Cluster overload: B thinks that A's transaction coordinator node is dead, because the coordinator node hasn't heartbeated the transaction record for a few seconds.
  • Deadlock: Some transaction B is trying to acquire conflicting locks in reverse order from transaction A.

Action:

If you are encountering deadlocks:

  • Avoid producing deadlocks in your application by making sure that transactions acquire locks in the same order.

If you are using only default transaction priorities:

  • This error means your cluster has problems. You are likely overloading it. Investigate the source of the overload, and do something about it. For more information, see Node liveness issues.

If you are using high- or low-priority transactions:

  1. Retry the transaction as described in client-side retry handling
  2. Adjust your application logic as described in minimize transaction retry errors.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_CLIENT_REJECT

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_CLIENT_REJECT) ...

Error type: Internal state error

Description:

The ABORT_REASON_CLIENT_REJECT error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_PUSHER_ABORTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) ...

Error type: Internal state error

Description:

The ABORT_REASON_PUSHER_ABORTED error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_ABORT_SPAN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORT_SPAN) ...

Error type: Internal state error

Description:

The ABORT_REASON_ABORT_SPAN error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_NEW_LEASE_PREVENTS_TXN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) ...

Error type: Internal state error

Description:

The ABORT_REASON_NEW_LEASE_PREVENTS_TXN error occurs because the timestamp cache will not allow transaction A to create a transaction record. A new lease wipes the timestamp cache, so this could mean the leaseholder was moved and the duration of transaction A was unlucky enough to happen across a lease acquisition. In other words, leaseholders got shuffled out from underneath transaction A (due to no fault of the client application or schema design), and now it has to be retried.

Action:

Retry transaction A as described in client-side retry handling.

ABORT_REASON_TIMESTAMP_CACHE_REJECTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED) ...

Error type: Internal state error

Description:

The ABORT_REASON_TIMESTAMP_CACHE_REJECTED error occurs when the timestamp cache will not allow transaction A to create a transaction record. This can happen due to a range merge happening in the background, or because the timestamp cache is an in-memory cache, and has outgrown its memory limit (about 64 MB).

Action:

Retry transaction A as described in client-side retry handling.

injected by inject_retry_errors_enabled session variable

 TransactionRetryWithProtoRefreshError: injected by `inject_retry_errors_enabled` session variable

Error type: Internal state error

Description:

When the inject_retry_errors_enabled session variable is set to true, any statement (with the exception of SET statements) executed in the session inside of an explicit transaction will return this error.

For more details, see Test transaction retry logic.

Action:

To turn off error injection, set the inject_retry_errors_enabled session variable to false.

See also


Yes No
On this page

Yes No