Automated alert and aggregation rule generation for CockroachDB metrics

Like all software systems, metrics are crucial for understanding the inner workings of a system and getting a pulse on how that system is functioning. Any monitoring and debugging framework is incomplete without metrics.

To use metrics effectively, however, it is important to understand two things: which aspect of the system a particular metric defines, and how it should be used for interpreting the health of the system. Additionally, to build effective monitoring dashboards and alerts it is also necessary to identify correlations between multiple metrics

Here at Cockroach we have frequently faced problems around underuse/misuse of metrics due to lack of documentation and guidance around how to aggregate and use the metrics as indicators of system health and performance. We aim to solve these problems!

This blog post discusses our plan for a solution: building an automated alert and aggregation rule generation framework for CockroachDB metrics.

Current State

CockroachDB outputs multiple metrics which provide indications on the health of the system. These metrics are viewable through the built-in DB Console UI. These metrics can also be programmatically integrated and viewed through Prometheus using the _status/vars HTTP endpoint. All of these metrics are annotated with a help field, which provides information about exactly what each metric measures.

We have, however, been lacking a framework to provide additional information about how these metrics should be consumed and used for monitoring and debugging. There is also no way to provide help on how individual metrics correlate with one another, nor how they can be aggregated together to provide a more holistic view of the DB system’s health.

Let’s look at an example. Consider the very simple metric ‘capacity’. The help field indicates this metric outputs the ‘Total storage capacity’. While this information is helpful to get a basic idea of what this metric outputs, there is very little information on how this metric can be interpreted to get storage information about CockroachDB. For example:

How can I use this metric to get the total node capacity?
How do I use it to get the total cluster capacity?
How can I use it to determine the current available capacity?

These are not easy questions to answer unless you are aware of how this metric has been implemented.

Sure, you can integrate these metrics into Prometheus, dig around with the various labels and try to reverse engineer how to use this metric to get to these answers, but this creates needless friction for our end users’ ability to use metrics for diagnostic and debugging purposes. And all this for a fairly basic metric! Answering nuanced questions around metrics usage gets even more complicated, of course, with more intricate metrics.

Understanding the various dimensions for a metric is crucial to using it in the right way.

While engineers working on CockroachDB internals are fully aware of how metrics are implemented and should be used, we need a framework which can enable them to communicate this knowledge in a way that can be consumed by our site reliability engineers (SREs), technical support engineers (TSEs) and even our customers. Even our database engineers may not have comprehensive knowledge on all metrics output by the database and how to use them.

Providing a mechanism for engineers implementing the metric to share more information on how to use the metric can be very valuable for all consumers of the metric. It is this knowledge gap that the automated alert and aggregation rule generation framework attempts to bridge.

Framework Design

The main design consideration here was to keep the new framework easy to understand and use — for both our end users who would consume this new information on metrics usage as well as database engineers implementing the metrics themselves.

The new framework will enable engineers to specify alerting and aggregation rules for metrics using PromQL syntax. We will use a registry internally to track all defined rules. These rules are exported in a YAML format through an HTTP endpoint and are consistent with Prometheus alert and aggregation rule syntax. By using PromQL syntax for defining alerting and aggregation expressions, we make these rules easy to use directly for monitoring DB health within Prometheus/Grafana.

While the HTTP endpoint cannot be consumed programmatically, the YAML format allows for the rules to be easily imported for usage in alerting/monitoring configurations.

Interface and API

This section will cover:

The interfaces and struct design for specifying alert and aggregation rules.
The API to expose these alert and aggregation rules in a YAML format.

Interface:

AlertingRule and AggregationRule will be used to construct alerts and aggregations respectively. Each rule will contain an expr string which will capture the metric(s) involved in the rule and how they should be aggregated. The expr will use PromQL syntax and will be a complete, valid prometheus expression. The expressions can be used as guidelines for end users while constructing the actual alert and recording rule yaml files for monitoring a cluster.

AlertingRule contains a field called recommendedHoldDuration. This can be used to optionally specify a recommended hold duration for the alert while building the rule. This could be used as a suggestion by end users for setting the hold duration while specifying the alert.

Some examples of alerting and aggregation rules definitions:

Once defined, these alerts and aggregation rules will be tracked using the RuleRegistry. The RuleRegistry is intended to be a singleton which will be initialized once during CockroachDB server startup and will be used to keep track of all defined rules. The rules can be added to the RuleRegistry using the AddRule/AddRules API.

Finally, to expose this information, we will add an introspection API that will publish all declared rules for the metrics via an HTTP endpoint (api/v2/metrics/rules).This endpoint will export the declared rules in a YAML format.These rules can be used as guidelines by end users such as SREs, TSEs, DB operators and customers for defining their monitoring and alerting configs.

To marshal the rules in YAML format, the PrometheusRuleExporter will encapsulate all logic required to scrape all declared rules from the RuleRegistry and export them in the YAML format.

The PrometheusRuleExporter struct will contain the following details:

In addition, it will expose two methods ScrapeRegistry and PrintAsYAML which will scrape the RuleRegistry for all declared rules and marshal them as YAML respectively.

The HTTP endpoint will expose this marshalled YAML at api/v2/rules/. Example view of how the data exposed through the API will look:

Future Work

We plan to extend this framework to include support for generating automated Grafana dashboards for CockroachDB metrics.

We want to hear from you! Please don’t hesitate to reach out if you have any feedback on features or capabilities you’d like to see with this new framework or on metrics in general. You can post it in the #product-feedback channels of the CockroachDB Community Slack, or call out to us @CockroachDB on Twitter.