Physical Cluster Replication Monitoring

On this page Carat arrow pointing down
Note:

This feature is in preview. This feature is subject to change. To share feedback and/or issues, contact Support.

New in v23.2: You can monitor a physical cluster replication stream using:

When you complete a cutover, there will be a gap in the primary cluster's metrics whether you are monitoring via the DB Console or Prometheus.

The standby cluster will also require separate monitoring to ensure observability during the cutover period. You can use the DB console to track the relevant metrics, or you can use a tool like Grafana to create two separate dashboards, one for each cluster, or a single dashboard with data from both clusters.

SQL Shell

In the standby cluster's SQL shell, you can query SHOW VIRTUAL CLUSTER ... WITH REPLICATION STATUS for detail on status and timestamps for planning cutover:

icon/buttons/copy
SHOW VIRTUAL CLUSTER application WITH REPLICATION STATUS;

Refer to Responses for a description of each field.

icon/buttons/copy
id |        name        |     data_state     | service_mode | source_tenant_name |                                                     source_cluster_uri                                               | replication_job_id |        replicated_time        |         retained_time         | cutover_time
---+--------------------+--------------------+--------------+--------------------+----------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------+-------------------------------+---------------
3  | application        | replicating        | none         | application        | postgresql://{user}:{password}@{hostname}:26257?options=-ccluster%3Dsystem&sslmode=verify-full&sslrootcert=redacted | 899090689449132033 | 2023-09-11 22:29:35.085548+00 | 2023-09-11 16:51:43.612846+00 |     NULL
(1 row)

Responses

Field Response
id The ID of a virtual cluster.
name The name of the standby (destination) virtual cluster.
data_state The state of the data on a virtual cluster. This can show one of the following: initializing replication, ready, replicating, replication paused, replication pending cutover, replication cutting over, replication error. Refer to Data state for more detail on each response.
service_mode The service mode shows whether a virtual cluster is ready to accept SQL requests. This can show one none or shared. When shared, a virtual cluster's SQL connections will be served by the same nodes that are serving the system virtual cluster.
source_tenant_name The name of the primary (source) virtual cluster.
source_cluster_uri The URI of the primary (source) cluster. The standby cluster connects to the primary cluster using this URI when starting a replication stream.
replication_job_id The ID of the replication job.
replicated_time The latest timestamp at which the standby cluser has consistent data — that is, the latest time you can cut over to. This time advances automatically as long as the replication proceeds without error. replicated_time is updated periodically (every 30s).
retained_time The earliest timestamp at which the standby cluster has consistent data — that is, the earliest time you can cut over to.
cutover_time The time at which the cutover will begin. This can be in the past or the future. Refer to Cut over to a point in time.
capability_name The capability name.
capability_value Whether the capability is enabled for a virtual cluster.

Data state

State Description
initializing replication The replication job is completing the initial scan of data from the primary cluster before it starts replicating data in real time.
ready A virtual cluster's data is ready for use.
replicating The replication job has started and is replicating data.
replication paused The replication job is paused due to an error or a manual request with ALTER VIRTUAL CLUSTER ... PAUSE REPLICATION.
replication pending cutover The replication job is running and the cutover time has been set. Once the the replication reaches the cutover time, the cutover will begin automatically.
replication cutting over The job has started cutting over. The cutover time can no longer be changed. Once cutover is complete, A virtual cluster will be available for use with ALTER VIRTUAL CLUSTER ... START SHARED SERVICE.
replication error An error has occurred. You can find more detail in the error message and the logs. Note: A physical cluster replication job will retry for 3 minutes before failing.

DB Console

You can access the DB Console for your standby cluster at https://{your IP or hostname}:8080/. Select the Metrics page from the left-hand navigation bar, and then select Physical Cluster Replication from the Dashboard dropdown. The user that accesses the DB Console must have admin privileges to view this dashboard.

Use the Graph menu to display metrics for your entire cluster or for a specific node.

To the right of the Graph and Dashboard menus, a time interval selector allows you to filter the view for a predefined or custom time interval. Use the navigation buttons to move to the previous, next, or current time interval. When you select a time interval, the same interval is selected in the SQL Activity pages. However, if you select 10 or 30 minutes, the interval defaults to 1 hour in SQL Activity pages.

When viewing graphs, two perpendicular lines will appear at your mouse cursor providing further insight into the data. The metric values are displayed in the legend under the graph. Click anywhere within the graph to pin the values in place, decoupling the values from your mouse movements. Click anywhere within the graph to cause the values to change with your mouse movements once more.

Hovering your mouse cursor over the graph title will display a tooltip with a description and the metrics used to create the graph.

Note:

The Physical Cluster Replication dashboard tracks metrics related to physical cluster replication jobs. This is distinct from the Replication dashboard, which tracks metrics related to how data is replicated across the cluster, e.g., range status, replicas per store, and replica quiescence.

The Physical Cluster Replication dashboard contains graphs for monitoring:

Logical bytes

DB Console Logical Bytes graph showing results over the past hour

The Logical Bytes graph shows you the throughput of the replicated bytes.

Hovering over the graph displays:

  • The date and time.
  • The number of logical bytes replicated in MiB.
Note:

When you start a replication stream, the Logical Bytes graph will record a spike of throughput as the initial scan completes.

SST bytes

DB Console SST bytes graph showing results over the past hour

The SST Bytes graph shows you the rate at which all SST bytes are sent to the KV layer by physical cluster replication jobs.

Hovering over the graph displays:

  • The date and time.
  • The number of SST bytes replicated in MiB.

Prometheus

You can use Prometheus and Alertmanager to track and alert on physical cluster replication metrics. Refer to the Monitor CockroachDB with Prometheus tutorial for steps to set up Prometheus.

We recommend tracking the following metrics:

  • physical_replication.logical_bytes: The logical bytes (the sum of all keys and values) ingested by all physical cluster replication jobs.
  • physical_replication.sst_bytes: The SST bytes (compressed) sent to the KV layer by all physical cluster replication jobs.
  • physical_replication.replicated_time_seconds: The replicated time of the physical replication stream in seconds since the Unix epoch.

Data verification

Note:

This feature is in preview. It is in active development and subject to change.

The SHOW EXPERIMENTAL_FINGERPRINTS statement verifies that the data transmission and ingestion is working as expected while a replication stream is running. Any checksum mismatch likely represents corruption or a bug in CockroachDB. Should you encounter such a mismatch, contact Support.

To verify that the data at a certain point in time is correct on the standby cluster, you can use the current replicated time from the replication job information to run a point-in-time fingerprint on both the primary and standby clusters. This will verify that the transmission and ingestion of the data on the standby cluster, at that point in time, is correct.

  1. Retrieve the current replicated time of the replication job on the standby cluster with SHOW VIRTUAL CLUSTER:

    icon/buttons/copy
    SELECT replicated_time FROM [SHOW VIRTUAL CLUSTER standbyapplication WITH REPLICATION STATUS];
    
         replicated_time
    ----------------------------
    2024-01-09 16:15:45.291575+00
    (1 row)
    

    For detail on connecting to the standby cluster, refer to Set Up Physical Cluster Replication.

  2. From the primary cluster's system virtual cluster, specify a timestamp at or earlier than the current replicated_time to retrieve the fingerprint. This example uses the current replicated_time:

    icon/buttons/copy
    SELECT * FROM [SHOW EXPERIMENTAL_FINGERPRINTS FROM VIRTUAL CLUSTER application] AS OF SYSTEM TIME '2024-01-09 16:15:45.291575+00';
    
    tenant_name |             end_ts             |     fingerprint
    ------------+--------------------------------+----------------------
    application | 1704816945291575000.0000000000 | 2646132238164576487
    (1 row)
    

    For detail on connecting to the primary cluster, refer to Set Up Physical Cluster Replication.

  3. From the standby cluster's system virtual cluster, specify the same timestamp used on the primary cluster to retrieve the standby cluster's fingerprint:

    icon/buttons/copy
    SELECT * FROM [SHOW EXPERIMENTAL_FINGERPRINTS FROM VIRTUAL CLUSTER standbyapplication] AS OF SYSTEM TIME '2024-01-09 16:15:45.291575+00';
    
        tenant_name     |             end_ts             |     fingerprint
    --------------------+--------------------------------+----------------------
    standbyapplication  | 1704816945291575000.0000000000 | 2646132238164576487
    (1 row)
    
  4. Compare the fingerprints of the primary and standby clusters to verify the data. The same value for the fingerprints indicates the data is correct.

See also


Yes No
On this page

Yes No