This feature is in preview and subject to change. To share feedback and/or issues, contact Support.
Active Session History (ASH) is a time-series sampling-based observability feature that helps you troubleshoot workload performance issues by capturing what work was actively executing on your cluster at specific points in time. Unlike traditional statement statistics that aggregate data over time, ASH provides point-in-time snapshots of active execution, making it easier to diagnose transient performance problems and understand resource usage patterns.
ASH is accessible via CockroachDB SQL and is disabled by default. To enable ASH, refer to Enable Active Session History.
How ASH sampling works
ASH captures point-in-time snapshots of each node's active work by sampling cluster activity at regular intervals (determined by the obs.ash.sample_interval cluster setting). At each sample point, ASH examines all goroutines that are actively executing or waiting, and it records what each one is doing. Each sample captures details like the workload that the goroutine belongs to (statement fingerprint, job ID, or system task) and the activity the goroutine is occupied with. These samples fill an in-memory circular ring buffer, the size of which is determined by the obs.ash.buffer_size cluster setting. When the buffer fills, the oldest samples are overwritten. Samples do not persist on disk. They are lost on node restart.
A user can query the ASH data for the specific node to which their SQL shell is connected (the gateway node) or across the whole cluster.
Because ASH is sampling-based rather than event-based, the sample count for a particular activity is proportional to how much time was spent on that activity. For example, if a query appears in 45 out of 60 sample points over one minute, it was actively consuming resources for approximately 45 seconds of that minute. This approach provides an accurate picture of resource usage patterns over time, but it means that short-lived operations completed between sampling points may not be captured. To troubleshoot very brief operations, you may need to reduce the obs.ash.sample_interval cluster setting, or use statement diagnostics. However, use caution when reducing the sampling interval, as this will cause the buffer to fill up quickly.
ASH does not provide exact resource accounting per query or job, nor does it provide the exact timing of individual executions. Its sampling-based approach instead provides a statistically reliable view of the system at a given time, which can help with troubleshooting.
These point-in-time samples can be used to:
- Root-cause slow queries: Understand exactly what a query was doing at specific points in time (e.g., waiting for locks, performing I/O, consuming CPU).
- Identify bottlenecks: Determine which resources (CPU, locks, I/O, network, admission control) are constraining workload performance.
- Troubleshoot transient issues: Diagnose performance problems that don't show up in aggregated statistics because they're intermittent or short-lived.
- Analyze resource usage patterns: Understand how different workloads (user queries, background jobs, system operations) consume cluster resources.
- Compare performance across time: Analyze how workload behavior changes during different time periods (e.g., peak vs. off-peak hours).
Use ASH alongside other monitoring tools
ASH complements CockroachDB's existing observability tools. Each tool serves a different purpose:
| Tool | What it's best for |
|---|---|
| Prometheus metrics | Real-time alerts on resource usage, latency, or errors. Monitoring overall cluster health and capacity trends over time. ASH is retrospective and sampling-based, not designed for alerting, and its in-memory storage means samples are lost on restart. |
| Statements Page (DB Console, Cloud Console) | Aggregated statistics about query performance over hours, days, or weeks. Identifying slow queries based on latency percentiles, tracking performance changes after schema updates, and understanding total resource consumption across all executions. ASH shows what was actively running at specific moments, but not the aggregated trends needed for performance baselines. |
| Insights Page (DB Console, Cloud Console) | Automatically flagging queries with performance problems such as high contention, failed executions, or suboptimal plans. Provides optimization recommendations without requiring diagnostic queries. ASH helps investigate why a problem occurred at a specific time, while Insights tells you which queries are problematic. |
| Statement diagnostics and execution logs | Complete traces of single query executions, including all operators, data flow, and exact execution plans with timings. ASH's sampling may miss short-lived operations and doesn't provide operator-level detail. |
ASH can often be used with these other tools to help troubleshoot issues. For example, Prometheus metrics might alert you to a problem (such as a CPU spike at 2:15 PM). ASH shows which queries and jobs were actively running during that spike and which took a disproportionate amount of time to run. The Statements Page then provides aggregated performance data for those queries over a longer period of time, and statement diagnostics give detailed execution plans for deeper analysis.
Configuration
The following cluster settings configure ASH behavior:
| Cluster Setting | Type | Default | Description |
|---|---|---|---|
obs.ash.enabled |
bool | false |
Enables ASH sampling on the cluster. |
obs.ash.sample_interval |
duration | 1s |
Time interval between samples. |
obs.ash.buffer_size |
int | 1,000,000 |
Max samples retained in memory. At ~200 bytes/sample, the default uses ~200MB per node. Changes take effect immediately. |
obs.ash.log_interval |
duration | 10m |
How often a top-N workload summary is emitted to the OPS log channel. Also used as the lookback window for ASH reports written by the environment sampler profiler. |
obs.ash.log_top_n |
int | 10 |
Max entries in each periodic log summary. |
obs.ash.report.total_dump_size_limit |
byte size | 32MiB |
Garbage collection limit for ASH report files on disk. |
obs.ash.response_limit |
int | 10,000 |
Max number of samples that each node will return to the gateway node when a user queries the cluster-wide view. This helps ensure that the gateway node does not run out of memory. |
ASH table reference
ASH data is accessible through two views in the information_schema system catalog:
information_schema.crdb_node_active_session_history: Includes samples from the gateway node.information_schema.crdb_cluster_active_session_history: Includes samples from all nodes in the cluster. Querying the cluster-wide view may be more resource-intensive for large clusters.
The data for each sample is placed into a row with the following columns:
| Column | Type | Description |
|---|---|---|
sample_time |
TIMESTAMPTZ NOT NULL |
When the sample was taken |
node_id |
INT NOT NULL |
Node where the sample was captured |
tenant_id |
INT NOT NULL |
Tenant that owns the work |
workload_id |
STRING |
Identifies the workload (refer to workload columns) |
workload_type |
STRING NOT NULL |
Kind of workload (refer to workload columns) |
app_name |
STRING |
Application name; only set for SQL statement workloads |
work_event_type |
STRING NOT NULL |
Resource category (refer to work_event columns) |
work_event |
STRING NOT NULL |
Specific activity label (refer to work_event columns) |
goroutine_id |
INT NOT NULL |
Go runtime goroutine ID |
workload columns
Each sample is attributed to a workload via the workload_type and workload_id columns. The encoding of workload_id depends on the workload_type:
workload_type |
What workload_id contains |
|---|---|
STATEMENT |
Hex-encoded statement fingerprint ID |
JOB |
Decimal job ID |
SYSTEM |
One of the following system task names: LDR, RAFT, STORELIVENESS, RPC_HEARTBEAT, NODE_LIVENESS, SQL_LIVENESS, TIMESERIES, RAFT_LOG_TRUNCATION, TXN_HEARTBEAT, INTENT_RESOLUTION, LEASE_ACQUISITION, MERGE_QUEUE, CIRCUIT_BREAKER_PROBE, GC, RANGEFEED, REPLICATE_QUEUE, SPLIT_QUEUE, DESCRIPTOR_LEASE |
UNKNOWN |
Unidentified. If you're seeing many unattributed samples for your workload, you may want to file an issue. |
work_event columns
The work_event_type categorizes the resource being consumed or waited on. Types include CPU, IO, LOCK, NETWORK, ADMISSION, and OTHER. The work_event gives the specific activity.
CPU
work_events whose work_event_type is CPU represent active computation:
work_event |
Location | Description |
|---|---|---|
Optimize |
SQL optimizer | Cost-based query optimization |
ReplicaSend |
KV server | Replica-level batch evaluation |
DistSenderLocal |
KV client | DistSender processing a local batch |
BatchFlowCoordinator |
DistSQL (columnar) | Columnar flow coordination |
ColExecSync |
DistSQL (columnar) | Synchronous columnar execution |
| (processor name) | DistSQL processors | Dynamic — each DistSQL processor registers with its own name (e.g. hashJoiner, tablereader) |
IO
work_events whose work_event_type is IO represent storage I/O:
work_event |
Location | Description |
|---|---|---|
KVEval |
KV server | Batch evaluation in the storage layer |
LOCK
work_events whose work_event_type is LOCK represent lock and latch contention:
work_event |
Location | Description |
|---|---|---|
LockWait |
Concurrency manager | Waiting to acquire a key-level lock |
LatchWait |
Span latch manager | Waiting to acquire a span latch |
TxnPushWait |
Concurrency manager | Waiting for a conflicting transaction to be pushed |
TxnQueryWait |
Concurrency manager | Waiting for the status of a conflicting transaction |
NETWORK
work_events whose work_event_type is NETWORK represent remote RPCs:
work_event |
Location | Description |
|---|---|---|
DistSenderRemote |
KV client | DistSender waiting on a remote node RPC |
InboxRecv |
DistSQL | Receiving data from a remote DistSQL flow |
OutboxSend |
DistSQL | Sending data to a remote DistSQL flow |
ADMISSION
work_events whose work_event_type is ADMISSION represent admission control queues:
work_event |
Location | Description |
|---|---|---|
kv-regular-cpu-queue |
Admission control | KV regular work waiting for CPU admission |
kv-elastic-store-queue |
Admission control | KV elastic work waiting for store admission |
kv-regular-store-queue |
Admission control | KV regular work waiting for store admission |
sql-kv-response |
Admission control | SQL layer waiting for KV response admission |
sql-sql-response |
Admission control | SQL layer waiting for DistSQL response admission |
ReplicationFlowControl |
Replication admission | Waiting for replication flow control token |
OTHER
work_events whose work_event_type is OTHER represent miscellaneous wait points:
work_event |
Location | Description |
|---|---|---|
CommitWaitSleep |
KV server | Transaction commit-wait for linearizability |
RaftProposalWait |
KV server | Waiting for a Raft proposal to be applied |
Backpressure |
KV server | Range backpressure from splits/merges |
LeaseAcquisition |
KV server | Waiting to acquire a range lease |
TenantRateLimit |
KV server | Tenant rate limiter throttling |
ASH Metrics
The following metrics monitor ASH behavior:
| Metric | Type | Description |
|---|---|---|
ash.work_states.active |
Gauge | Number of goroutines with an active work state |
ash.sampler.take_sample.latency |
Histogram | Latency of each sample collection tick |
ash.samples.collected |
Counter | Total ASH samples collected |
Debug zip integration
When the environment sampler triggers goroutine dumps or CPU profiles, ASH writes aggregated report files (.txt and .json) alongside them. These reports are included in cockroach debug zip output. The naming pattern for these files is as follows:
ash_report.{TIMESTAMP}.{TRIGGER}.{FORMAT}
TIMESTAMP: When the report was made (formatted as2006-01-02T15_04_05.000)TRIGGER: What event triggered the report (goroutine_dumporcpu_profile)FORMAT:.txt(human-readable) or.json(structured)
For example: ash_report.2026-03-05T12_00_00.000.goroutine_dump.txt
The lookback window for these reports is controlled by the obs.ash.log_interval cluster setting.
Common use cases and examples
ASH is accessed through the built-in CockroachDB SQL shell. Run cockroach sql to open the shell. CockroachDB Cloud deployments can also use the SQL Shell page on the Console.
Enable Active Session History
To enable ASH on your cluster:
SET CLUSTER SETTING obs.ash.enabled = true;
Enabling ASH begins collecting samples immediately. The in-memory buffer will fill up over time based on workload activity and the configured ASH cluster settings.
View a node's work event data from the past minute
Scenario: A node is experiencing high resource utilization, but it's unclear which subsystem (CPU, I/O, locks, network) is consuming resources and what specific operations are involved.
You can query the node-level ASH view to see what resources the node has been consuming:
SELECT work_event_type, work_event, count(*) AS sample_count
FROM information_schema.crdb_node_active_session_history
WHERE sample_time > now() - INTERVAL '1 minute'
GROUP BY work_event_type, work_event
ORDER BY sample_count DESC;
The query returns the count of samples for each work event type and specific event:
work_event_type | work_event | sample_count
------------------+------------------------+--------------
NETWORK | DistSenderRemote | 42
OTHER | RaftProposalWait | 38
CPU | upsert | 19
ADMISSION | ReplicationFlowControl | 12
LOCK | LockWait | 8
IO | KVEval | 5
CPU | ReplicaSend | 3
NETWORK | InboxRecv | 2
(8 rows)
The results show this node's activity is dominated by network waits (DistSenderRemote) and Raft consensus waits (RaftProposalWait), which is typical for write-heavy workloads that must replicate data across nodes. The upsert CPU samples show time spent executing upsert statements, while ReplicationFlowControl admission samples indicate that the system is throttling writes due to replication backpressure. The LockWait samples indicate some transaction-level lock contention on hot keys. To identify which specific workloads are causing these waits, add workload_type, workload_id, and app_name to the query and group by them.
View cluster-wide workload data from the past 10 minutes
Scenario: Overall cluster resource consumption is high, but it's unclear which workloads (user queries, background jobs, or system tasks) are responsible for the activity.
You can query the cluster-wide ASH view to identify the top workloads consuming resources across all nodes:
SELECT workload_type, workload_id, app_name, count(*) AS sample_count
FROM information_schema.crdb_cluster_active_session_history
WHERE sample_time > now() - INTERVAL '10 minutes'
GROUP BY workload_type, workload_id, app_name
ORDER BY sample_count DESC
LIMIT 10;
The query returns the top 10 workloads by sample count:
workload_type | workload_id | app_name | sample_count
----------------+----------------------+-------------------------+--------------
STATEMENT | 9bef06d795045524 | kv | 2603
SYSTEM | INTENT_RESOLUTION | | 1383
JOB | 1028061800150958081 | | 925
SYSTEM | TXN_HEARTBEAT | | 166
SYSTEM | GC | | 91
STATEMENT | 61a0effae6169169 | $ internal-create-stats | 23
SYSTEM | LEASE_ACQUISITION | | 17
SYSTEM | NODE_LIVENESS | | 12
STATEMENT | ad360b79112da7e3 | myapp | 9
SYSTEM | TIMESERIES | | 7
(10 rows)
The results show that a single SQL statement fingerprint (9bef06d795045524) from the kv application is the largest consumer of cluster resources. System tasks like INTENT_RESOLUTION (async cleanup of transaction intents) and TXN_HEARTBEAT are also significant. To investigate the statement, use the workload_id to find the statement on the Statements page. To investigate the job, use its workload_id on the Jobs page. CockroachDB Cloud users can use the Statements page and Jobs page in the Cloud Console.
Find recent lock contention hotspots
Scenario: Elevated p99 latency and increased transaction retries indicate contention, but it's unclear which specific workloads are experiencing lock waits and what type of contention is occurring.
You can filter ASH samples to show only lock-related wait events:
SELECT workload_id, work_event, app_name, count(*) AS sample_count
FROM information_schema.crdb_node_active_session_history
WHERE work_event_type = 'LOCK'
AND sample_time > now() - INTERVAL '5 minutes'
GROUP BY workload_id, work_event, app_name
ORDER BY sample_count DESC;
The query shows which workloads experienced lock contention and what type of lock events they encountered:
workload_id | work_event | app_name | sample_count
-------------------+------------+---------------------------------------------+---------------
3be3baac97a1c623 | LatchWait | $ internal-write-job-progress-history-prune | 1
(1 row)
The results identify the statement fingerprint experiencing latch waits. Use the workload_id to locate the query on the Statements page and examine its execution plan and contention time. Review the Insights page for contention insights on this statement. CockroachDB Cloud users can use the Statements page and Insights page in the Cloud Console. If multiple workloads show LockWait events, investigate whether they're accessing the same tables or rows by examining their query patterns. For detailed contention analysis, see Monitor and Analyze Transaction Contention.
Get details about what a specific job is spending time on
Scenario: A background job (such as a backup, schema change, or import) is running longer than expected, but it's unclear whether the job is consuming CPU, waiting on I/O, or blocked by other resources.
You can filter by workload type and job ID to understand where the job is spending its time:
SELECT work_event_type, work_event, count(*) AS sample_count
FROM information_schema.crdb_node_active_session_history
WHERE workload_type = 'JOB'
AND workload_id = '12345'
AND sample_time > now() - INTERVAL '30 minutes'
GROUP BY work_event_type, work_event
ORDER BY sample_count DESC;
The query breaks down the job's resource consumption by work event type and specific activity:
work_event_type | work_event | sample_count
-----------------+---------------------+--------------
CPU | backupDataProcessor | 10
CPU | ReplicaSend | 10
(2 rows)
The results show the job spent most of its time on active computation. This breakdown helps identify that the backup job is primarily CPU-bound rather than I/O or lock-constrained. To find the job ID for a running job, query the Jobs page or use SHOW JOBS: SELECT job_id, description, status FROM [SHOW JOBS]. CockroachDB Cloud users can use the Jobs page in the Cloud Console.
Known limitations
- ASH is not recommended for nodes with 64 or more vCPUs, due to degraded performance on those nodes. #168289
- On Basic and Standard CockroachDB Cloud clusters, ASH samples only cover work running on the SQL pod. KV-level work (storage I/O, lock waits, replication, etc.) is not visible in ASH samples.
- KV work triggered during COMMIT (for example, intent resolution, Raft proposals deferred from earlier statements in an explicit transaction) is attributed to the last statement's fingerprint, not the statement that originally caused the work. #165864