Distributed tracing and performance monitoring in CockroachDB

When you’re working with distributed systems, data storage and retrieval aren’t as straightforward as they are in legacy monolithic databases. This comes with advantages like resilience and high availability, but it means that performance monitoring of a given transaction can be challenging. Query execution is often an extremely complex web of interactions. Following and analyzing performance bottlenecks in this environment can be difficult and sometimes frustrating to get to the root cause. For this reason we added distributed tracing to our UI and made an accompanying tutorial below.

What is Distributed Tracing?

Distributed tracing is a method of monitoring application performance by tracing the path of a query to identify issues that could be impacting application performance.

In Kubernetes, this challenge has been solved using distributed tracing. Companies like Lightstep are leading the way to simplify observability analysis for these complex environments. At Cockroach Labs, we added distributed tracing for transactions as an option in CockroachDB 20.1.

Distributed Tracing in CockroachDB vs Lightstep

It’s important to distinguish between ‘distributed tracing’ as it’s understood in the field of Observability compared to the distributed tracing that you have access to in CockroachDB.

CockroachDB implements the open tracing standard, so you can use tools like Jaeger, Zipkin, and Lightstep to trace transactions within CockroachDB. Companies in the Observability field instrument multiple layers of their stack to track a request as it’s shuttled through various services. This allows them to trace action from outside of the database to the action happening inside the database.

Here’s a simple example that paints the picture of how distributed tracing works in CockroachDB:

Let’s say people are waiting too long after clicking a button on your application. In CockroachDB you can use a trace to determine what happens when that button is clicked: What all is happening in the background? Where time is being spent? What happens in parallel vs what is blocked waiting on other things to finish.? First you’ll see that the database request is taking too long to return. Then you’ll isolate which query was the slow one, after which you’ll run a trace in CockroachDB for that specific query/transaction to see what made that specific request slow. Now you’re ready to fix the issue, and improve your application performance.

Demo of Distributed Tracing

In this demo video, Senior Product Manager Piyush Singh points a TPCC workload at a CockroachDB cluster and walks through how to troubleshoot queries directly in the Admin UI. You’ll see how he identifies the performance issue in the admin UI, opens up a diagnostic report which tells the system to trace the next query with a fingerprint that matches the slow query. In addition to the trace you’ll receive additional diagnostic information that you could need to troubleshoot query performance.

To learn more about distributed tracing and how to troubleshoot your query, check out our docs on diagnostics reporting in CockroachDB. Also of interest will be this documentation around different ways of making your queries faster (one of which is to diagnose with a trace).