Backup and Restore Monitoring

CockroachDB includes metrics to monitor , , and jobs. You can use monitoring integrations to alert when there are anomalies, such as backups that have failed or restore jobs encountering a retryable error. You can access the Prometheus Endpoint to track and alert on backup and restore metrics. We recommend setting up monitoring to when anomalies occur. You can then use the following SQL statements to inspect details relating to schedules, jobs, and backups:

Metrics are reported per node. Therefore, it is necessary to retrieve metrics from every node in the cluster. For example, if you are monitoring whether a backup fails, it is necessary to track scheduled_backup_failed on each node.

Prometheus endpoint

You can access the (http://<host:<http-port/_status/vars) for backup and restore metrics. Refer to the tutorial for guidance on installing and setting up Prometheus and Alertmanager to track metrics.

Available metrics

We recommend the following guidelines:

Use the schedules.BACKUP.last_completed_time metric to monitor the specific backup job or jobs you would use to recover from a disaster.
Configure alerting on the schedules.BACKUP.last_completed_time metric to watch for cases where the timestamp has not moved forward as expected.

Metric	Description
`schedules.BACKUP.failed`	The number of scheduled backup jobs that have failed. Note: A stuck scheduled job will not increment this metric.
`schedules.BACKUP.last_completed_time`	The Unix timestamp of the most recently completed scheduled backup specified as maintaining this metric. Note: This metric only updates if the schedule was created with the .
New in v23.1: `schedules.BACKUP.protected_age_sec`	The age of the oldest protected by backup schedules.
New in v23.1: `schedules.BACKUP.protected_record_count`	The number of held by backup schedules.
`schedules.BACKUP.started`	The number of scheduled backup jobs that have started.
`schedules.BACKUP.succeeded`	The number of scheduled backup jobs that have succeeded.
`schedules.round.reschedule_skip`	The number of schedules that were skipped due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the schedule option.
`schedules.round.reschedule_wait`	The number of schedules that were rescheduled due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the schedule option.
New in v23.1: `jobs.backup.currently_paused`	The number of backup jobs currently considered .
`jobs.backup.currently_running`	The number of backup jobs currently running in `Resume` or `OnFailOrCancel` state.
`jobs.backup.fail_or_cancel_retry_error`	The number of backup jobs that failed with a retryable error on their failure or cancelation process.
`jobs.backup.fail_or_cancel_completed`	The number of backup jobs that successfully completed their failure or cancelation process.
`jobs.backup.fail_or_cancel_failed`	The number of backup jobs that failed with a non-retryable error on their failure or cancelation process.
New in v23.1: `jobs.backup.protected_age_sec`	The age of the oldest protected by backup jobs.
New in v23.1: `jobs.backup.protected_record_count`	The number of held by backup jobs.
`jobs.backup.resume_failed`	The number of backup jobs that failed with a non-retryable error.
`jobs.backup.resume_retry_error`	The number of backup jobs that failed with a retryable error.
New in v23.1: `jobs.restore.currently_paused`	The number of restore jobs currently considered .
`jobs.restore.currently_running`	The number of restore jobs currently running in `Resume` or `OnFailOrCancel` state.
`jobs.restore.fail_or_cancel_failed`	The number of restore jobs that failed with a non-retriable error on their failure or cancelation process.
`jobs.restore.fail_or_cancel_retry_error`	The number of restore jobs that failed with a retryable error on their failure or cancelation process.
New in v23.1: `jobs.restore.protected_age_sec`	The age of the oldest protected by restore jobs.
New in v23.1: `jobs.restore.protected_record_count`	The number of held by restore jobs.
`jobs.restore.resume_completed`	The number of restore jobs that successfully resumed to completion.
`jobs.restore.resume_failed`	The number of restore jobs that failed with a non-retryable error.
`jobs.restore.resume_retry_error`	The number of restore jobs that failed with a retryable error.

Datadog integration

To use the Datadog integration with your CockroachDB self-hosted cluster, you can set up the Datadog platform to collect and alert on the available backup metrics. Refer to the for instructions.

Available metrics in Datadog

Metric	Description
`schedules.BACKUP.succeeded`	The number of scheduled backup jobs that have succeeded.
`schedules.BACKUP.started`	The number of scheduled backup jobs that have started.
`schedules.BACKUP.last_completed_time`	The Unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric.
`schedules.BACKUP.failed`	The number of scheduled backup jobs that have failed.

Get Started

CockroachDB and AI

Feature Overview

Connect to an Application

Self-Hosted Deployments

Schema Design

Reads and Writes

Stream Data

Multi-Region Capabilities

Optimize Performance

Integrate

Backup and Restore Monitoring

Prometheus endpoint

Available metrics

Datadog integration

Available metrics in Datadog

See also

​Prometheus endpoint

​Available metrics

​Datadog integration

​Available metrics in Datadog

​See also

Prometheus endpoint

Available metrics

Datadog integration

Available metrics in Datadog

See also