Enriching Log Messages Using Go Contexts

Building a complex system requires writing and debugging many tests. Developer tests that can be run without any infrastructure or configuration are crucial for allowing fast-paced development while avoiding regressions. Even for a distributed system like CockroachDB, tests that run on a single anode (e.g. as part of go test) can cover many aspects of the system: one technique we use is that of creating a virtual test cluster with multiple CockroachDB nodes, all running within the same process.

We typically use log messages to understand what is happening as part of a test and to identify problems. This blog post goes over how we ensure that the log messages contain enough information to be useful without encumbering the development of new code.

Log tags

Let’s take an example of a log message:

log.Infof(ctx, "streamed snapshot: kv pairs: %d, log entries: %d", n, len(logEntries))

This message indicates that we transferred a streaming snapshot. By itself, the information in this message is unlikely to be useful during debugging: we can have many ranges in multiple stores, so we need to identify which range and store this message pertains to.

The immediate solution is to simply add information about the node, store, and range to the message itself. However, this solution has several problems:

  • It is tedious and repetitive: we have to include extra information in every message.
  • It leads to inconsistent formatting for the same information – as an example, at some point all of the following were in use among our log messages: range=5, r5, RangeID=5, RangeID:5.
  • Every layer needs to have the information for all of the above layers (for example, the code issuing this message needs to know the store and the node).
  • The determination of which layers to show information for is fixed. If we decide to add another layer – for example, imagine we want to run a test with multiple clusters and we want each log message to indicate which test cluster it pertains to – we need to go through all layers and manually add the extra information.

To address these shortcomings, we decided to pass the information that we want to associate with log messages through a context.Context. Note that plumbing Contexts throughout our code is necessary regardless in order to be able to correlate operations spanning multiple nodes (distributed tracing). So the solution is to embed a chain of logging tags into Contexts (via an API that appends a log tag to a context) and to always pass a context when logging a message; the logging library inspects the tags embedded in the context and includes them in the message. Example:

ctx := context.Background()
log.Infof(ctx, “test”)    // Message: “test”
ctx = log.WithLogTag(ctx, “n”, 1)
log.Infof(ctx, “test”)    // Message: “[n1] test”
ctx = log.WithLogTag(ctx, “s”, 2)
log.Infof(ctx, “test”)    // Message: “[n1,s2] test”
ctx = log.WithLogTag(ctx, “bootstrap”, nil)
log.Infof(ctx, “test”)    // Message: “[n1,s2,bootstrap] test”

In some cases we want the value of a tag to be dynamic. For example, the node ID might not be known during early initialization time; or, the start and end keys that define a range might change over time. These cases are supported by passing an object as the tag value; the object is stringified every time we issue a message, allowing publishing of the up-to-date value via a String() function.

Here is how our example log shows up with the proper log tags:

I161119 08:37:36.515739 3313 storage/store.go:3134  [n1,s1,r8/1:/Table/51/1/{700-800}] streamed snapshot: kv pairs: 112, log entries: 9

We know that this event happened on node 1, store 1, range 8, replica 1 and the range holds keys between /Table/51/1/700 and /Table/51/1/800. Note that the format of this information is intentionally concise because it appears in almost every log message.

Initial solution: background contexts

We want each layer to inherit the logging tags of the parent layer, and possibly expand them. Our initial solution was to equip each layer with a “background context”. The Node has a background context which contains the node ID log tag (in the example above, n1). When creating a Store (which conceptually is a “child” of a node), the node passes its context and the store adds a store log tag (s1) to create its background context. Similarly, the Replica background context is derived from the store context, with the addition of the range and replica log tag (r8/1:/Table/51/1/{700-800}).

These contexts can be used directly for periodic background operations associated with each layer. But most operations – for example those that are part of a SQL statement – have a context.Context of their own. We want to use the operation context wherever possible (to support cancellation and tracing), but we want to also have the log tags of the proper layer. To do this, we have a WithLogTagsFromCtx function which can copy the log tags from a background context to an operation context (skipping any duplicate tags). Most layers settled on the pattern of providing a logContext wrapper and using it in functions receiving an outside context:

// logContext adds the node, store and replica log tags to a context. Used to
// personalize an operation context with this Replica's identity.
func (r *Replica) logContext(ctx context.Context) context.Context {
    // Copy the log tags from the base context. This allows us to opaquely
    // set the log tags that were passed by the upper layers.
    return log.WithLogTagsFromCtx(ctx, r.ctx)
}

func (s *Store) Send(ctx context.Context, ...) ... {
    // Attach any log tags from the store to the context (which normally
    // comes from gRPC).
    ctx = s.logContext(ctx)
    ...

A cleaner solution: the AmbientContext

The solution outlined above works, but it has some disadvantages that became nagging in time. A lot of code was switched to directly use long lived background contexts (like r.ctx); code which was lacking correct plumbing of operation contexts used to be implicitly marked by the use of context.TODO() but now looked like it has a “legitimate” context.

When reading code, it became hard to track whether the context passed to a function was coming from an operation context or from a background context. The proliferation of background contexts was also problematic for tracing: logging using a background context does not generate any tracing events since the context is not associated with any particular operation.

This culminated in a proposal for a new solution. The idea was to standardize an interface similar to logContext and rely solely on that interface, avoiding storing and using long-lived background contexts directly.

The implementation involves a new structure called AmbientContext which stores log tags and can be used to add its tags to a Context. Instead of storing a background context.Context in each layer, we now store an AmbientContext. Similar to before, each layer is created using the parent layer’s AmbientContext onto which new log tags are added.

When annotating an operation context, AmbientContext.AnnotateCtx() can be used just like logContext above (note that AmbientContext is embedded into the Store structure):

func (s *Store) Send(ctx context.Context, ...) ... {
    // Attach any log tags from the store to the context (which normally
    // comes from gRPC).
    ctx = s.AnnotateCtx(ctx)
    ...

When a context is used in a function for which an operation context is not plumbed, we can use AnnotateCtx(context.TODO()). This is effectively equivalent to the previous solution where we used the background context directly; however in the code we now have an explicit indication of the fact that we are not using a proper operation context.

AmbientContext also has a variant which creates a tracing span, which is a container for a set of events related to an operation and allows correlation of events happening on multiple nodes (as part of that same operation). This is appropriate for periodic background operations:

ctx, span := p.AnnotateCtxWithSpan(context.Background(), "ts-poll")
defer span.Finish()
if err := p.db.StoreData(ctx, p.r, data); err != nil {
    ...

This code creates a tracing span with operation name “ts-poll” and an associated context. Events associated with the StoreData operation are recorded within this trace; in particular all log messages (like log.Infof in our example) used with ctx will be recorded as events in this trace. In most cases of background operations, creating the span is the right thing to do, as it enables visualisation of events in a tool like LightStep. This is the biggest advantage of AmbientContext: whereas in the old code we were content with directly using something like p.ctx, now we have to make an explicit decision to create a context and possibly a tracing span. Indeed, while refactoring code to use the new model, many internal operations were corrected to use tracing properly.

Using enriched log messages

We found that using enriched log messages has helped our debugging efforts, although the only metrics I have are in measures of sanity. If you ever struggled with log messages that lacked context, give this model a try in your project!

Share your thoughts on the HackerNews thread.

How does CockroachDB compare to Cassandra or MongoDB?

Read the FAQ