Common Issues to Monitor

On this page Carat arrow pointing down

This page summarizes how to configure and monitor your cluster to prevent issues commonly encountered with:

CPU

Note:

In our sizing and production guidance, 1 vCPU is considered equivalent to 1 core in the underlying hardware platform.

Issues with CPU most commonly arise when there is insufficient CPU to support the scale of the workload.

CPU planning

Provision enough CPU to support your operational and workload concurrency requirements:

Category Recommendations
CPU
  • Each node should have at least 4 vCPUs.
  • Use larger VMs to handle temporary workload spikes and processing hot spots.
  • Use connection pooling to manage workload concurrency. The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount. For more details, see Sizing connection pools.
  • See additional CPU recommendations in the Production Checklist.

CPU monitoring

Monitor possible signs of CPU starvation:

Parameter Description
Service latency The time between when the cluster receives a query and finishes executing the query.
CPU usage The CPU consumption by the CockroachDB node process.
Workload concurrency The number of SQL statements being executed on the cluster at the same time.
LSM health The health of the persistent stores.
Node health The operational status of the nodes.

Service latency

Degradation in SQL response time is the most common symptom of CPU starvation. It can also be a symptom of insufficient disk I/O.

If latencies are consistently high, check for:

CPU usage

Compaction on the storage layer uses CPU to run concurrent worker threads.

  • The CPU Percent graph on the Hardware and Overload dashboards shows the CPU consumption by the CockroachDB process, and excludes other processes on the node.

    Expected values for a healthy cluster: CPU utilized by CockroachDB should not persistently exceed 80%. Because this metric does not reflect system CPU usage, values above 80% suggest that actual CPU utilization is nearing 100%.

If CPU usage is high, check whether workload concurrency is exceeding CPU resources.

Workload concurrency

The number of concurrent active SQL statements should be proportionate to your provisioned CPU.

  • The SQL Statements graph on the Overview and SQL dashboards shows the 10-second average of SELECT, UPDATE, INSERT, and DELETE statements being executed per second on the cluster or node. The latest QPS value for the cluster is also displayed with the Queries per second counter on the Metrics page.

    Expected values for a healthy cluster: At any time, the total number of actively executing SQL statements should not exceed 4 times the number of vCPUs in the cluster. For more details, see Sizing connection pools.

If workload concurrency exceeds CPU resources, you will observe:

Tip:

To prevent issues with workload concurrency, provision sufficient CPU and use connection pooling for the workload.

LSM health

Issues at the storage layer, including an inverted LSM and high read amplification, can be observed when compaction falls behind due to insufficient CPU or excessively high recovery and rebalance rates.

  • The LSM L0 Health graph on the Overload dashboard shows the health of the persistent stores, which are implemented as log-structured merge (LSM) trees. Level 0 is the highest level of the LSM tree and consists of files containing the latest data written to the Pebble storage engine. For more information about LSM levels and how LSMs work, see Log-structured Merge-trees.

    Expected values for a healthy cluster: The number of L0 files should not be in the high thousands. High values indicate heavy write load that is causing accumulation of files in level 0. These files are not being compacted quickly enough to lower levels, resulting in a misshapen LSM.

    Note:

    An unhealthy LSM can be caused by other factors, including under-provisioned storage. To correlate this symptom with CPU starvation, check for high CPU usage and excessive workload concurrency.

  • The Read Amplification graph on the Storage Dashboard shows the average number of disk reads per logical SQL statement, also known as the read amplification factor.

    Expected values for a healthy cluster: Read amplification factor should be in the single digits. A value exceeding 50 for 1 hour strongly suggests that the LSM tree has an unhealthy shape.

  • The STORAGE logging channel indicates an unhealthy LSM with the following:

    • Frequent compaction status messages.
    • High-read-amplification warnings, e.g., sstables (read amplification = 54).
Tip:

If compaction has fallen behind and caused an inverted LSM, throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount. If a node is severely impacted, you can start a new node and then decommission the problematic node.

Node health

If issues at the storage layer remain unresolved, affected nodes will miss their liveness heartbeats, causing the cluster to lose nodes and eventually become unresponsive.

  • The Node status on the Cluster Overview page indicates whether nodes are online (LIVE) or have crashed (SUSPECT or DEAD).

  • The /health endpoint of the Cluster API returns a 500 error when a node is unhealthy.

  • A Prometheus alert can notify when a node has been down for 15 minutes or more.

If nodes have shut down, this can also be caused by insufficient storage capacity.

Note:

If the cluster becomes unavailable, the DB Console and Cluster API will also become unavailable. You can continue to monitor the cluster via the Prometheus endpoint and logs.

Memory

CockroachDB is resilient to node crashes. However, frequent node restarts caused by out-of-memory (OOM) crashes can impact cluster stability and performance.

Memory planning

Provision enough memory and allocate an appropriate portion for data caching:

Category Recommendations
Memory
  • Provision at least 4 GiB of RAM per vCPU.
  • For production deployments, set --cache to 25% or higher. Avoid setting --cache and --max-sql-memory to a combined value of more than 75% of a machine's total RAM. Doing so increases the risk of memory-related failures. For more details, see the Production Checklist.
  • Disable Linux memory swapping. Over-allocating memory on production machines can lead to unexpected performance issues when pages have to be read back into memory.
  • See additional memory recommendations in the Production Checklist.

Memory monitoring

Monitor memory usage and node behavior for OOM errors:

Metric or event Description
Node process restarts Nodes restarting after crashes.
SQL memory usage The memory allocated to the SQL layer.
Database memory usage The memory in use by CockroachDB processes.

Node process restarts

CockroachDB attempts to restart nodes after they crash. Nodes that frequently restart following an abrupt process exit may point to an underlying memory issue.

  • The Node status on the Cluster Overview page indicates whether nodes are online (LIVE) or have crashed (SUSPECT or DEAD).

  • When deploying on Kubernetes, the kubectl get pods output contains a RESTARTS column that tracks the number of restarts for each CockroachDB pod.

  • The OPS logging channel will record a node_restart event whenever a node rejoins the cluster after being offline.

  • A Prometheus alert can notify when a node has restarted more than once in the last 10 minutes.

Verify OOM errors

If you observe nodes frequently restarting, confirm that the crashes are caused by OOM errors:

  • Monitor dmesg to determine if a node crashed because it ran out of memory:

    icon/buttons/copy
    $ sudo dmesg | grep -iC 3 "cockroach"
    

    The following output indicates the node crashed due to insufficient memory:

    $ host kernel: Out of Memory: Killed process <process_id> (cockroach).
    
  • When deploying on Kubernetes, run kubectl logs {pod-name} and look for OOM errors in the log messages.

Tip:

To prevent OOM crashes, provision sufficient memory. If all CockroachDB machines are provisioned and configured correctly, either run the CockroachDB process on another node with sufficient memory, or reduce the memory allocated to CockroachDB.

If you confirm that nodes are crashing due to OOM errors, also check whether SQL queries may be responsible.

SQL memory usage

An untuned SQL query can consume significant resources and impact the performance of other workloads.

  • The SQL Memory graph on the SQL dashboard shows the current amount of memory in KiB allocated to the SQL layer.

    Expected values for a healthy cluster: This value should not exceed the --max-sql-memory size. A healthy threshold is 75% of allocated --max-sql-memory.

  • The "active query dump", enabled by default with the diagnostics.active_query_dumps.enabled cluster setting, is a record of anonymized active queries that is written to disk when a node is detected to be under memory pressure.

    You can use the active query dump to correlate specific queries to OOM crashes. Active query dumps have the filename activequeryprof.{date-and-time}.csv and are found in the heap_profiler directory in the configured logging directory. They are also included when running cockroach debug zip.

  • A SHOW STATEMENTS statement can identify long-running queries on the cluster that may be consuming excessive memory.

  • A memory budget exceeded error in the logs indicates that --max-sql-memory, the memory allocated to the SQL layer, was exceeded by the operation referenced in the error. For guidance on resolving this issue, see Common Errors.

Tip:

If you find queries that are consuming too much memory, cancel the queries to free up memory usage. For information on optimizing query performance, see SQL Performance Best Practices.

Database memory usage

CockroachDB memory usage includes both accounted memory, such as the amount allocated to --cache and --max-sql-memory; and unaccounted memory, such as uncollected Go garbage and process overhead.

  • The Memory Usage graph on the Runtime dashboard shows the total memory in use by CockroachDB processes. The RSS (resident set size) metric represents actual CockroachDB memory usage from the OS/Linux/pod point of view. The Go and CGo metrics represent memory allocation and total usage from a CockroachDB point of view.

    Expected values for a healthy cluster: RSS minus Go Total and CGo Total should not exceed 100 MiB. Go Allocated should not exceed a few hundred MiB. CGo Allocated should not exceed the --cache size.

For more context on acceptable memory usage, see Suspected memory leak.

Storage and disk I/O

The cluster will underperform if storage is not provisioned or configured correctly. This can lead to further issues such as disk stalls and node shutdown.

Storage and disk planning

Provision enough storage capacity for CockroachDB data, and configure your volumes to maximize disk I/O:

Category Recommendations
Storage
  • Provision volumes with 320 GiB per vCPU .
  • Use dedicated volumes for the CockroachDB store. Do not share the store volume with any other I/O activity.
  • Store CockroachDB log files in a separate volume from the main data store so that logging is not impacted by I/O throttling.
  • See additional storage recommendations in the Production Checklist.
Disk I/O
  • Disks must be able to achieve 500 IOPS and 30 MB/s per vCPU .
  • Do not use LVM in the I/O path. Dynamically resizing CockroachDB store volumes can result in significant performance degradation. Using LVM snapshots in lieu of CockroachDB backup and restore is also not supported.
  • See additional disk I/O recommendations in the Production Checklist.
|

Storage and disk monitoring

Monitor storage capacity and disk performance:

Metric or event Description
Storage capacity The available and used disk capacity in the CockroachDB store.
Disk IOPS The I/O requests per second.
Node heartbeat latency The time between node liveness heartbeats.
Command commit latency The speed at which Raft commands are being committed by nodes in the cluster.

Storage capacity

CockroachDB requires disk space in order to accept writes and report node liveness. When a node runs out of disk space, it shuts down and cannot be restarted until space is freed up.

  • The Capacity graph on the Overview and Storage dashboards shows the available and used disk capacity in the CockroachDB store.

    Expected values for a healthy cluster: Used capacity should not persistently exceed 80% of the total capacity.

  • A Prometheus alert can notify when a node has less than 15% of free space remaining.

Tip:

Ensure that you provision sufficient storage. If storage is correctly provisioned and is running low, CockroachDB automatically creates an emergency ballast file that can free up space. For details, see Disks filling up.

Disk IOPS

Insufficient disk I/O can cause poor SQL performance and potentially disk stalls.

  • The Disk Ops In Progress graph on the Hardware dashboard shows the number of disk reads and writes in queue.

    Expected values for a healthy cluster: This value should be 0 or single-digit milliseconds for short periods of time. If the values persist in double digits, you may have an I/O bottleneck.

  • The Linux tool iostat (part of sysstat) can be used to monitor IOPS. In the device status output, avgqu-sz corresponds to the Disk Ops In Progress metric. If service times persist in double digits on any node, this means that your storage device is saturated and is likely under-provisioned or misconfigured.

Tip:

Ensure that you properly configure storage to prevent I/O bottlenecks. Afterward, if service times consistently exceed 1-5 ms, you can add more devices or expand the cluster to reduce the disk latency.

With insufficient disk I/O, you may also see:

Node heartbeat latency

Because each node needs to update a liveness record on disk, maxing out disk bandwidth can cause liveness heartbeats to be missed.

Command commit latency

  • The Command Commit Latency: 50th percentile and Command Commit Latency: 99th percentile graphs on the Storage dashboard show how quickly Raft commands are being committed by nodes in the cluster. This is a good signal of I/O load.

    Expected values for a healthy cluster: On SSDs (strongly recommended), this should be between 1 and 100 milliseconds. On HDDs, this should be no more than 1 second.

See also


Yes No
On this page

Yes No