SQL Prober: Black-box monitoring in Managed CockroachDB

This blog post reflects my work as an intern this summer with the SRE team at Cockroach Labs. Earlier this year, we started building a service offering called Managed CockroachDB. While working with a group of beta customers, we found an unconventional solution to a huge problem: how do we create black-box monitoring in a distributed system?

Improving our monitoring system: black-box monitoring!

Four months into our beta release, we had an incident: creating new databases and tables was broken globally on one of our internal beta clusters. The Core team tracked the issue down to a bug that involved Raft, and it was presumed to be caused by a network partition. The temporary solution was easy: restarting the Cockroach node with the broken range fixed the issue.

While it was an easy fix, it alerted us to a much bigger issue: we needed something in place that would reduce our Mean Time to Detect (MTTD). All of our alerts during that time were built on top of metrics collected through CockroachDB. These metrics, also known as white-box metrics, are data collected based on the internals of the system.

Furthermore, when the issue was resolved, there wasn’t a way to measure the impact of the incident. We know for sure that the customer was not able to run queries for a period of time, but how long was it exactly? We didn’t know.

Why we built black-box monitoring

There were a few ways to address this problem. We could either rely on the existing white-box metrics that we had and add more alerting rules, or build a new monitoring tool that uses black-box metrics. Since we wanted something that would allow us to test the behavior of clusters as our customers would see them, the latter approach (also known as black-box monitoring) seemed more appropriate. By doing this, we could also avoid the situation in which our monitoring system goes down if the system that is being monitored goes down. In the end, we built SQL Prober, an internal black-box monitoring system^[1]. Our initial goal was to reduce MTTD and measure the uptime of our customers’ clusters.

The framework for SQL Prober is simple. It functions like a cron (a time-based job scheduler) and executes a set of probe tasks periodically at fixed times. On top of that, it contains a service discovery mechanism built using CockroachDB’s node liveness data^[2]. Node liveness data provide us with authoritative information about which nodes are live at any given point in time, and if that breaks, most of the core features in CockroachDB will fail to function.

A unique architecture requires a unique solution

How can we collect black-box metrics through SQL Prober? Well, we could add a task that issues SELECT 1 queries to the database and collect appropriate metrics. But this will not exercise all the layers that CockroachDB has, and in particular, the key-value storage. We need queries that will cover the key-value layer because that is where all the data is coming from.

There are a couple of constraints that we needed to satisfy. We wanted:

Queries that will hit the key-value layer.
Queries that will not affect our customer’s data.
Queries that can detect when a node will fail to serve customer’s data.

We could easily satisfy (1) and (3) by issuing read and write queries to all of our customer’s schemas. However, this does not satisfy (2) and there is a possibility that there will be a lot of schemas and data to work with. Satisfying (2) with this approach might be possible by modifying CockroachDB itself, but that involves quite a bit of work, and we still need to solve the latter problem.

On the other hand, we could satisfy (1) and (2) by issuing read and write queries to our own custom schema, but how do we ensure we could satisfy (3)? In CockroachDB, data get scattered all over the cluster in the form of ranges. There could be a possibility in which nodes that store customer’s data do not overlap with nodes that store our own data, so issuing queries to our own schema will not help with (3).

Luckily, we have a solution that could satisfy all three constraints: replication zones. We could use replication zones to ensure that every node in the cluster contains at least one range of our data and the leaseholders for those ranges are scattered evenly across the cluster. That way, sending a single SELECT * statement will reach all the nodes in the cluster. The assumption that we are relying on here is that if a node fails to serve a single range, it might fail to serve other ranges as well. That is how we satisfy (3).

Using geo-partitioning in black-box monitoring

As we have defined previously, geo-partitioning grants developers row-level replication control. We created a nodes table that contains just a single node_id column (also the primary key), and attempted to store data for each row onto the relevant nodes based on the node’s ID. The table stores the node IDs of all the nodes in the cluster. Here’s the CREATE TABLE statement for the nodes table:

CREATE TABLE IF NOT EXISTS `nodes` ( node_id INT NOT NULL, PRIMARY KEY (node_id) )

The ALTER TABLE ... PARTITION BY statement is then used to create a partition for each row. (Each node can only correspond to a single row since node_id is the primary key of the nodes table.) For example, the ALTER TABLE ... PARTITION BY statement for a three-node healthy CockroachDB cluster is shown below:

ALTER TABLE nodes PARTITION BY LIST (node_id) ( PARTITION node_1 VALUES IN (1), PARTITION node_2 VALUES IN (2), PARTITION node_3 VALUES IN (3), PARTITION "default" VALUES IN (DEFAULT) )

Now that we have defined partitions on the nodes table, we will need to configure location-specific replication zones for these partitions. For this to work, each CockroachDB node is configured at startup with hierarchical information about its locality. These localities are then used as targets of replication zone configurations. To set these zone configurations, the ALTER PARTITION ... CONFIGURE ZONE statement is used.

Since we want each partition to link to its corresponding node, we will need to create a replication zone that is unique to each node using both constraints and lease_preferences. Node IDs seem like the perfect key to use for these constraints. Unfortunately, locality flags can only be specified during startup at the moment, and there isn’t a way to obtain those node IDs during that time. We resorted to using the hostnames of nodes to identify each node.

By default, each non-system range has three replicas. We used a per-replica constraint to pick the exact location of just one replica of each range, and locate the lease of each range to be the same location as the replica that we just picked. By doing that, we know that if that specific node is healthy, the lease must be located on that node, and when its corresponding row is queried, the data will be obtained from that node.

Here’s an example. Assuming that node 1 has a hostname of localhost:26257, the statement below ensures that at least one replica for partition node_1 is stored on node 1, and the lease for ranges of that partition is preferred to be located on node 1.

ALTER PARTITION node_1 OF TABLE nodes CONFIGURE ZONE USING constraints='{"+dns=localhost:26257": 1}', lease_preferences = '[[+dns=localhost:26257]]'

The only caveat to using this approach is that if the hostname of a specific node changes (e.g. caused by a node restart), we will need to update the zone configuration corresponding to that specific node’s partition.

The Probe Task

Now that we have geo-partitioning set up, the only thing left is to query from the nodes table and ensure that the request is served by our desired nodes by checking the leaseholders of those ranges. Note that depending on when entries are inserted into the table, there is a possibility that a lease transfer needs to happen. To solve that, we will wait approximately one minute before executing the probe task for the constraints to be satisfied and leases to be transferred. We cannot guarantee that all leases will be transferred after a minute, but that period should be sufficient for lease transfers to occur.

If you are familiar with replication zones, you might be wondering: why not reduce the replica count to 1 and avoid checking leaseholders? That is something that we could potentially work on in the future. At the moment, it seems like querying the internal crdb_internal.ranges schema will hang when any range has a loss of quorum. Some investigation needs to be made, and we might need to make changes to the database itself. To move forward with the SQL Prober, we resorted to checking the leaseholders as a first step. (See #29638 for more information.)

For every probe task, we will pick a random node to be the gateway node from the list of healthy nodes. (Note that this gateway node is SQL Prober specific and is unrelated to the gateway node used in CockroachDB as described in the Life of a Distributed Transaction article.) This list is kept updated by the service discovery mechanism that we have described earlier. In order to verify that we could actually read data from all the healthy nodes, we will need to:

Verify leaseholders for ranges of the nodes table.
Read data from the nodes table.

Earlier, we applied the lease_preferences constraints to partitions. This will attempt to place the leaseholders for ranges of the nodes table in specific locations, and if that fails, it will fallback to locations of other replicas which are available. Step 1 is crucial because there is a possibility that the leaseholder for a specific range is not held by our desired node. If verification of leaseholder fails, it is meaningless to run Step 2 since that range request will be served by a different node.

Step 1: Verify leaseholders for ranges of the nodes table

We have created a custom nodes_partitions view for leaseholder verification. nodes_partitions is a view that returns the ranges of the nodes table, specifying which partition the ranges belong to and the current leaseholders for those ranges. With that view, it is trivial to verify leaseholders for ranges of the nodes table. (Details about the view will be described in the next section.)

Here’s an example output of the nodes_partitions view:

root@:26257/monitoring> select * from nodes_partitions; node_id | lease_holder | range_id | replicas +---------+--------------+----------+----------+ 1 | 1 | 82 | {1,2,3} 2 | 2 | 83 | {1,2,3} 3 | 3 | 84 | {1,2,3} (3 rows)

We could issue a query to verify if the values in both the node_id and lease_holder columns match. A mismatch signifies that either a node is down or the lease has not been transferred yet.

Step 2: Read data from the nodes table

Now that we have verified leases for ranges that we care about, we will execute a simple SELECT statement to read from the nodes table for nodes that we care about. Queries should return within a timeout period, and if they do, we are done.

A deep-dive into the nodes_partitions view

Based on the schema of the nodes table and how partitions were created, we can be sure that a partition can only have one range. We will now attempt to understand how ranges of the nodes table are mapped to CockroachDB nodes based on the structure of those ranges.

Ranges are contiguous chunks of CockroachDB’s key-space, and every range contains starting and ending markers. These markers define what keys are stored in a particular range. The keys are represented using the following structure:

/<table Id>/<index id>/<indexed column values>

The table itself is stored with an index_id of 1 for its PRIMARY KEY column(s). These keys are actually used in our underlying key-value storage to map table data to key-value storage. (If you want to learn more about this, feel free to check out the article about the Distribution Layer in CockroachDB and this blog post on mapping table data to key-value storage.)

By default, a newly created empty table will have one range corresponding to it. Partitioning a table by list without applying any zone configurations will not result in range splits unless the size of the default range exceeds range_max_bytes for that specific zone (which is unlikely), a manual range split is triggered, or whenever ranges are eligible for load-based splitting.

We will illustrate range splits due to applying zone configurations using the example below. For the purpose of this example, we will assume a three-node CockroachDB cluster and that the ID of the created nodes schema is 677. Here’s the complete schema of the nodes table with partitions defined:

CREATE TABLE IF NOT EXISTS nodes ( node_id INT NOT NULL, PRIMARY KEY (node_id) ) PARTITION BY LIST (node_id) ( PARTITION node_1 VALUES IN (1), PARTITION node_2 VALUES IN (2), PARTITION node_3 VALUES IN (3), PARTITION "default" VALUES IN (DEFAULT) )

By default, we have the following range for the nodes table:

root@:26257/monitoring> SELECT range_id, start_pretty, end_pretty FROM crdb_internal.ranges WHERE table_name = 'nodes'; range_id | start_pretty | end_pretty +----------+--------------+------------+ 802 | /Table/677 | /Max (1 row)

Usually, keys of a specific range will span from /Table/<table id> to /Table/<next table id>. If the next table ID does not exist, the key will be /Max.

Applying a zone configuration on partition node_3 will result in range splits:

Observe that two splits occurred - one at /Table/677/1/3, and another one /Table/677/1/4. Since the index used here is the primary key of the table, which is an integer, we would expect that the next indexed column value is 4. We can now infer that range 773 corresponds to partition node_3, and in fact, if we were to continue to apply zone configurations to the remaining partitions, more range splits will occur and we will notice that the partition that the range belongs to can be determined by just looking at the starting key (with a small caveat that will be described below). We will leverage this idea to map ranges to CockroachDB nodes.

Now that we know how to map ranges to CockroachDB nodes, we can generate our nodes_partitions view for leaseholder verification. There are two edge cases that we will need to consider when generating the view:

Range splits that are made on non-consecutive node IDs. For example, node_id = 1 and 4, resulting in the ranges below: range_id | start_pretty | end_pretty +----------+----------------+----------------+ 800 | /Table/677 | /Table/677/1/1 801 | /Table/677/1/1 | /Table/677/1/2 --> Node 1 802 | /Table/677/1/2 | /Table/677/1/4 803 | /Table/677/1/4 | /Table/677/1/5 --> Node 4 804 | /Table/677/1/5 | /Max
To avoid this situation, we will need to verify that the size of the range (end_key - start_key) is 1.
The view could possibly show ranges for partitions that we do not care about. For example, if we apply zone configurations to partitions for nodes 1 and 3, the nodes_partitions view generated using the approach described above will show ranges belonging to partitions for nodes 1, 2, and 3. This can be easily avoided by applying a filter to only project the nodes that we care about when querying the view.

And finally, here’s the CREATE VIEW statement for our nodes_partition schema:

CREATE VIEW nodes_partitions AS SELECT start_key AS node_id, lease_holder, range_id, replicas FROM [SELECT SUBSTR(start_key, 2)::int AS start_key, SUBSTR(end_key, 2)::int AS end_key, lease_holder, range_id, replicas FROM [SELECT crdb_internal.pretty_key(start_key, 2) AS start_key, crdb_internal.pretty_key(end_key, 2) AS end_key, crdb_internal.lease_holder(start_key) AS lease_holder, range_id, replicas FROM crdb_internal.ranges_no_leases WHERE database_name = 'monitoring' AND table_name = 'nodes'] WHERE LENGTH(start_key) > 0 AND LENGTH(end_key) > 0] WHERE end_key = start_key + 1

In the statement above, we have used a few built-in functions:

crdb_internal.pretty_key(raw_key: bytes, skip_fields: int): To convert the keys in bytes to strings. In our case, we would want skip_fields = 2 to skip <table id> and <index id>.
crdb_internal.leaseholder(raw_key: bytes): To fetch the leaseholder corresponding to raw_key.
length(val: string): Calling pretty_key(...) on /Table/<table id>or /Max will yield an empty string, and hence the length(...)call to remove all boundary ranges that are irrelevant.
substring(input: string, substr_pos: int) : Used to trim the / prefix from the output of pretty_key(...).

Future Work for SQL Prober

Now that we have the basic framework for SQL Prober, we could easily add more tasks to it. One important future work is to make sense of the metrics data that we have collected. For example, building a dashboard and linking metrics to our existing alerting system.

There are also a couple of ideas that we might be working on next:

There’s quite a bit of bookkeeping that needs to be made for leaseholder verification. One thing that we could potentially explore is to rely on just the constraints when applying zone configurations and drop the leaseholder preferences. For that to work, we will need to change the replica count to 1. However, if we were to do that, the unavailability of a single node might cause some queries to timeout when we have a loss of quorum. See #29638 for more information. I do not know if there are any side effects to this (i.e. if anything is relying on the crdb_internal.ranges schema), so some investigation needs to be made.
Only reads are being tested at the moment. We could extend the nodes table to test writes as well.
Database-level replication zones can be used so that Data Definition Language (DDL) statements such as CREATE and ALTER can be tested as well.

TL;DR

Here’s a quick summary of lessons I learned at my internship: with SQL Prober, we can not only measure the impact of incidents when they occur, but also reduce our MTTD since we have an improved monitoring system. Black-box monitoring helps to catch issues that white-box monitoring couldn’t. Building a monitoring system needs to be carefully planned. We should collect metrics for everything and only alert on the important ones. This is not something that can be done in a short period of time as both metrics and alerts need to be refined over time for a good monitoring and alerting system.

Geo-partitioning is usually used for multi-region workloads to improve latency. I find it to be an amazing feature, and it is also a key differentiator for CockroachDB. In this blog post, we have explored one unconventional way of using geo-partitioning, which is for black-box monitoring. What else can you do with geo-partitioning?

If you have not tried out geo-partitioning yet, I would suggest you to check out the following articles:

Are you interested in monitoring distributed systems or helping build a Database as a Service (DBaaS) on Kubernetes? We’re hiring for SREs and other positions! Check out our careers page to learn more.

Footnotes

[1] For more details on white-box and black-box monitoring, check out this article which is based on Google's SRE book: Monitoring Distributed Systems.

[2] See CockroachDB’s internal virtual database, crdb_internal, specifically the gossip_liveness and gossip_nodes schemas, for more information.