Log and error redaction in CockroachDB v20.2

Log and error redaction in CockroachDB v20.2

CockroachDB users trust us with their most sensitive data (see: healthcare, finance). And the best way for us to maintain that trust is for Cockroach Labs to never see this data at all.

In CockroachDB v20.2, our tooling is able to automatically redact users' sensitive data out of log files, so that Cockroach Labs never even receives it. We also do this always for crash report telemetry. 

Data sharing in CockroachDB deployments

In the default configuration, CockroachDB automatically sends anonymized telemetry data at periodic intervals to Cockroach Labs (see here to turn off diagnostics reporting). This telemetry data is documented online and devoid of details about the data stored in a user’s cluster. Even the SQL metadata, the schema of databases and tables, is anonymized to only keep information about the structural relationships between columns, indexes and other database objects. Moreover, users can completely opt out of this telemetry reporting if they so choose. This has been true ever since telemetry was first introduced, in 2016.

Additionally, when a CockroachDB node crashes or encounters an expected error, details about the situation are reported automatically to Cockroach Labs. We currently use Sentry.io as a collector for crash and error events. As with telemetry data, error data has been heavily redacted and fully anonymized ever since it was introduced, in 2017, and users can also opt out of automated reporting.

Finally, CockroachDB continuously prints out details about its behavior into log files. These log files are stored alongside a cluster’s data, by default in the store directory. Log messages are extremely descriptive and spell out the lifecycle of a cluster over time. Therefore, they are invaluable when troubleshooting issues. Naturally, log files are not automatically collected and users must choose to send them to us when asking for help.

When some data is too little data

The text of error messages in CockroachDB can contain bits and pieces from a user’s application, for example a failed STRING to INT conversion can reveal sensitive data in the STRING value. 

The assembly of an error message from variables and other dynamic state inside CockroachDB involves multiple parts of the code base. An error or crash payload can even contain data from multiple components or layers in the architecture, as it traverses a distributed cluster to emerge at a SQL client connection boundary. 

There is no single engineer or team responsible for reviewing and editing the composition of all errors. Neither would we want to build such a team, as it would likely become a bottleneck and create friction against future development and our ability to quickly improve and evolve CockroachDB.

Moreover, all the pieces of data that compose an error are glued together as simple character strings. The use of Go’s string type inside error object is a pervasive feature of Go’s ecosystem, and most of the software dependencies that CockroachDB is built upon depend on this basic abstraction. The problem with simple strings is that they do not have any internal structure: we have no way to know, at the system boundary, whether a special word inside a string comes from CockroachDB’s own source code or from data entered by an application or stored into a SQL table.

Therefore, the redaction code for crash and error reporting up to and including v20.1 was extremely conservative and was eliminating many useful pieces of information from payloads, to avoid the risk of exposing user data. In practice, we often found ourselves unable to investigate crash or error reports, because of the lack of context.

This is why we wanted more data included in error and crash reports.

When some data is too much data

Log files are populated with log entries from all components throughout the CockroachDB code base. Naturally, they can contain details about pretty much anything—including configuration data, client details, SQL schema and values.

As for error payloads, log messages are composed as strings. The payload of a single log event can be assembled from components across multiple areas in the source code. As for errors, it would be undesirable to channel the authoring of every log message through a strict review and editing process: the velocity of the team would be seriously impaired. And, as for errors, the idea to use simple character strings to represent log messages internally is extremely ingrained throughout the Go ecosystem, with many of our code dependencies using a Printf-like abstraction for logging to CockroachDB’s own logs.

However, for log files we cannot afford to strip any of this data out before log files are stored on disk. When problems arise and a situation needs to be troubleshooted, a user absolutely needs to know “what happened” so they can successfully recover and move forward. Therefore, log files contain all the details collected by the CockroachDB code; no redaction takes place.

This puts us in a bind when a user finds themselves unable to troubleshoot a situation on their own, and they approach us for help. Just as them, we do need details about their cluster’s lifecycle. We need the details that are present in their log files. But our users rightfully would prefer if we did not get our hands on the sensitive bits of their business: the data they have stored “inside” the SQL tables or the IP addresses of their private servers is often inconsequential when troubleshooting hard problems, yet is included in log files.

This is why our users wanted less data included in the log files they send us for analysis.

A need for semantic boundaries in amorphous strings

The problem we set out to solve was to improve the sharing of data between CockroachDB users and Cockroach Labs, to maximize the amount of non-sensitive data that could be shared, while ensuring that no sensitive data would be shared.

It’s interesting here to consider the asymmetry of the problem statement.

Cockroach Labs can certainly work with incomplete details about clusters deployed by users. It is often possible to troubleshoot problems without a full picture of a situation. Cockroach Labs thus wanted more data, but certainly not all of it, and we were not picky about “how much” — just some more would already have been quite good.

However, meanwhile, many users certainly care that none of their data ever, by any means, leaves their infrastructure. Just a single word out from a SQL table leaking out through logs or telemetry could constitute a severe breach of confidentiality and a catastrophic legal liability (imagine, the name of a celebrity found in telemetry data for a company that manufactures cancer drugs). To these users, there is no discussion that they can tolerate “some” amount of data leak for the greater good of troubleshootability. If asked to choose, they’d choose “none”; and when pressed, they would likely drop CockroachDB altogether. 

Meanwhile, the structure of data flow inside CockroachDB’s source code is rather promiscuous: variable values flow from one package to another, and there is not a great deal of boundary between variables that contain a copy of user data and those that don’t. Therefore, generally inside CockroachDB we are not yet able to confidently point out to variables and say “this one is sensitive” and needs to be handled with care.

In fact, the Go language that CockroachDB is built with has an extremely poor type system, which makes it practically difficult to customize data types to separate “sensitive” from “non-sensitive” bits. The most used data type is string, which represents unstructured sequences of characters. They can represent anything, and are used for pretty much everything. Strings are largely amorphous and ubiquitous. The code that builds strings, especially errors and logging events, mixes and matches sensitive data with insensitive data into strings. To be conservative, a redaction algorithm that is handed mere strings can thus only consider that all of the characters inside are sensitive, lest the redaction run the risk of leaking some forbidden words. This, by the way, is what our error redaction code had to do.

This is, incidentally, also why we have always categorically refused to use regular expressions or other means of pattern matching to extract safe information from log files or error reports. There is not enough structure inside CockroachDB strings to make pattern matching work: any pattern that a hopeful operator would design could accidentally include a string built by an application. Since string compositions inside CockroachDB evolve rapidly, as engineers change code when adding features or refactoring code, a pattern developed one day could stop working the next, or suddenly become more inclusive. Just a single accident where a pattern match would report a false positive could yield an unacceptable, trust-breaching data leak.

What was needed instead was to introduce more structure inside those pesky amorphous strings inside the CockroachDB source code. We needed to create a semantic boundary between data that was definitely cool to report to Cockroach Labs, and data that maybe wasn’t.

Finality of strings

Technically, our problem was somewhat non-trivial: Go programmers rely on the ubiquitousness and promiscuity of strings; it makes them able to work fast. The Go compiler does not very well enforce abstract type boundaries between string-like data types, so that it remains very easy to convert back-and-forth. In particular, all the string formatting and composition libraries out there remove any type distinction when composing strings together, so that the final result is a simple, amorphous Go string

Go programmers think it’s a feature, not a bug: we could not readily change that, as going against a programming language’s idioms is a sure way to reduce programmer productivity and ease of hire and training of new staff.

What we did instead was to recognize a unique property of error objects and log messages in the CockroachDB project: they are not just write-once (all Go strings are); they are final. “Final” here means that they are never used as input to compose more complex strings. 

This is easy to recognize with log messages: the log messages are immediately printed to the output log file, but they are not “read back” by CockroachDB to do other things.

For error objects, the situation is a little subtler. A Go error object may store strings and other data. An error object can be chained with another error object to build a more complex error object. Then, at the end, an error object can be transformed into a string via its Error() method. The key observation here is that the composition of error objects does not cause the strings and other values inside them to be transformed. The chaining of errors with each other does not imply that there is composition of different strings into larger amorphous strings. The conversion of a Go error into its representation as a string only happens if the code that handles the error chooses to convert it into a string, after which it is not an error object any more. In its “object” form with the error type, the strings inside errors are final too.

With this understanding, we could then choose to change the data type and encoding for “final strings”, to introduce a clear separation between sensitive and non-sensitive data.

It did not matter that we did something unusual in Go to achieve that, because there was no other Go code inside CockroachDB that would “consume” this unusual data. Crudely said, nobody in the CockroachDB teams really cares about how such final data is represented or stored—they never get to see it or manipulate it in their day-to-day job, unlike the other strings in their code.

Redactable strings at error and log boundaries

This is how we introduced a new data type in our code base: final strings in log messages and error payloads are now implemented via a new data type, called “redactable string”, or RedactableString in Go.

(We have implemented this data type and the various facilities around it in a public standalone library, for free reuse by the Go community: see https://github.com/cockroachdb/redact )

A redactable string largely behaves like a Go string, except that it delineates sensitive data with redaction markers ‹ … ›. These are unicode single brackets, with code points U+2039 and U+203A. We chose them because they are visually discreet and extremely unlikely to be encountered as safe data that is useful for troubleshooting.

Then, redactable strings provide a method called “Redact” which automatically strips all marked sensitive data and replaces it with “‹×›”. We have added calls to this feature in various parts of CockroachDB, for example in the command cockroach debug zip which automatically collects log files.

To make this abstraction robust, when composing a redactable string from regular (or other kinds of) strings, the composition automatically strips any occurrences of the redaction markers in the input data and replaces them with a generic question mark “?”. This ensures that data containing the markers does not accidentally break the string boundaries during composition. We do not mind the fact that this may cause partial information loss, because again these strings are “final” —geared towards external output for communication to humans during troubleshooting, not communication between systems or data preservation. 

Moreover, we chose to ensure that the default and easiest way to build redactable strings from other Go strings is to consider the input string as sensitive. This means that the easiest way to create a redactable string conservatively places the entirety of the resulting composition in-between redaction markers. This way, when we introduced redactable strings in the CockroachDB project, we did not need to revisit all the existing logging and error code: we could simply let it “do its thing” with the comfort and confidence that all the resulting strings would still be considered sensitive and not risk being leaked.

This conservative approach may at first look like it did not buy us much, as it still looks like most strings would be redacted out conservatively. That was done on purpose: we were very deliberate by assuming that all this error and logging data would be treated as sensitive until  we had built definite confidence it was not. 

We then gradually opted certain strings out of this “sensitive” status.

The main and most powerful mechanism we implemented was to consider every literal constant string inside CockroachDB’s source code as “safe”, i.e. non-sensitive. Literal constants are those things spelled out “as-is” inside the source code. Of course, the Go compiler does not help libraries distinguish literal from non-literal strings, so we achieved that in a roundabout way. We defined certain parts of our API by fiat to only take literal strings as arguments. Then, we added linter programs that enforce this rule via source code analysis during CI runs. This is the current idiomatic way to extend Go’s type system—by enforcing it “externally”. (We could perhaps choose to tweak the Go compiler instead, but for now we want to preserve the ability of our community to build their own CockroachDB binaries using the standard toolchain.)

Another wide-impact choice we made was to mark certain Go data types as “always safe”. For example, we automatically consider simple integers and durations as “safe” during the composition of redactable strings. This ensures, for example, that variables that represent range IDs and timeouts can always be included in reported data as “safe”. This does not include, however, SQL values with e.g. INTEGER or INTERVAL types, as these are represented inside CockroachDB using different (non-simple) data types.

After that step, we also explained throughout our team how redactable strings work, so that certain parts of CockroachDB could carefully and manually opt into the special rules around them. We have established some automation as well to ensure that any unusual use of redactable strings pop out clearly during code reviews.

The mental overhead to think about this remains low, however, since all this work is occurring only for “final” strings in our logging and errors infrastructure. In fact, most of the associated complexity is fully abstracted behind CockroachDB’s util/log and errors packages.

More details about the design, as well as a review of the alternatives we considered, can be found here: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200427_log_file_redaction.md 

Look and feel in CockroachDB v20.2

The most visible impact of this work is the appearance of redaction markers inside CockroachDB log files, starting in v20.2. Users will notice that any piece of data in log files that comes from their configuration or their cluster’s stored values will be enclosed in-between redaction markers. They will also likely notice that redaction markers enclose things that most definitely look like they are not sensitive, but this is a by-product of our choice of being conservative: these are data items that were non-literal constants in the CockroachDB source code, and will still need to review the remainder of those to determine which are definitely not sensitive. Until we do this work, we do not assume they aren’t and so they get redaction markers, just like users' data.

Separately, we have built a new command-line flag --redact-logs in the commands cockroach debug zip and cockroach debug merge-logs. These are commands that we document as instruments of data collection when submitting support cases. The new flag ensures that any data inside redaction markers, including all the users' sensitive data, is erased before it is sent to Cockroach Labs.

(Note that cockroach debug zip at this time only knows how to redact sensitive bits out of log files. Sensitive bits in other non-log files are not edited out. Proceed with care. This is tracked as github issue #52470.)

Finally, we also have rewritten our Sentry.io reporting code so as to use redactable strings from error payloads. As expected, the error redaction removes anything between redaction markers before it is sent to Sentry. This is not directly visible when operating CockroachDB normally, however a user can satisfy themselves this is true by proxying and inspecting the traffic between CockroachDB and errors.cockroachdb.com.

Next steps for log and error redaction

The work on log and error redaction that led to v20.2 was primarily concerned with the relationship between Cockroach Labs and CockroachDB users. In that relationship, any data that can identify a CockroachDB user, like their IP addresses or hostnames, is just as sensitive as the data stored inside their clusters. From that perspective, we only needed a binary distinction between “non-sensitive data”, which is data confidently known to be safe to be seen by Cockroach Labs, and “sensitive data”, which would be everything else.

Some of our customers have already approached us to explain that this distinction was too simplistic. Since we built this feature, we have been taught by our community that users care about another distinction: that between “operational” data and “application” data.

This distinction is similar to the one we made already, but needs to occur entirely on the user’s side: it establishes a data boundary between the application developers and the DBAs and system administrators that operate a CockroachDB cluster. 

In that relationship, it is customary to prevent the DBA or support engineer at the user’s organization from seeing sensitive data produced by applications. For example, a bank’s DBA may need to see operational data that pertains to networking problems, disk access errors etc, but prevented from seeing the name of account holders and their statements.

Today, CockroachDB’s redaction markers capture both operational and application data under a single label: “sensitive” data. Our users find this distinction too coarse, and instead wish for an additional distinction, between “operational” and “application” data in log files and error payloads. As we want to enable more CockroachDB deployments, in particular in larger Enterprise organizations with multiple departments and different data access policies, we will need to extend our mechanisms in this direction.

Keep Reading

Gotchas and solutions running a distributed system across Kubernetes clusters

```

I recently gave a talk at KubeCon North America -- “Experience Report: Running a Distributed System …

Read more
GCP outpaces Azure, AWS in the 2021 Cloud Report

The 2021 Cloud Report stands on benchmarks. Now in its third year, our report is more precise than ever, capturing an …

Read more
Sharing screens: What's it like to be an engineer at Cockroach Labs

When it comes to learning, we have all benefited from social learning in the workplace. Social learning is an …

Read more