Be flexible and consistent: JSON comes to CockroachDB

We are excited to announce support for JSON in our 2.0 release (coming in April) and available now via our most recent 2.0 Beta release. Now you can use both structured and semi-structured data within the same database. No longer will you need to sacrifice ACID guarantees, accuracy, or the ability to scale in order to use multiple data models within the same database. This post will explain how we implemented JSON and give you a few examples of how JSON can be used to model your data.

How can I use JSON in CockroachDB?

Imagine that you run a service that allows users to keep track of their insect collection. Some pieces of structured data might be obvious to you as a developer, but many may not. Using a JSONB column gives you the flexibility to store whatever data you want in a way that's still queryable by the database. Consider the case of modeling insect collections. First, you might create a table that allows for storing JSON data.

CREATE TABLE insects ( id UUID PRIMARY KEY, user_id UUID REFERENCES users, species STRING REFERENCES insect_species, acquisition_date TIMESTAMP, name STRING, metadata JSONB, INDEX (species) );

As you can see from this data, only Craig has a nickname (e.g., “Craiggy”). As a developer (and not your application's end user), you might not have considered the need for nicknames as not every customer gives a nickname to their insects.

For a full demo of JSONB, please consult our documentation.

Now that we’ve covered one example in which you might want to use JSONB within a relational database, let’s dive deeper into the implementation details.

Postgres design

After reviewing the above example, you may notice that we implemented JSONB within CockroachDB based on the Postgres design. We chose to do this so that our users are not forced to learn another new syntax to support JSON. This means that you can use JSON in CockroachDB via the vibrant third-party Postgres ecosystem. Further, applications developed initially for Postgres will largely work out-of-the-box with JSONB in CockroachDB.

What is JSONB?

CockroachDB implemented JSONB rather than JSON. Postgres introduced JSONB as an optimized version of JSON strings. This means that it is in a binary format that doesn’t require parsing the entire document. We chose to support JSONB over JSON because our research showed that JSONB was far more popular. Further, storing the data in an optimized binary format (JSONB) rather than text (JSON) allows for quicker operations.

To illustrate the difference, suppose someone hands you the JSON metadata describing a insect:

{“favorite_food”: “crumbs”, “life_history”: [...], “best_friend”: “Cynthia”}

A common query run on this could be: who is this cockroach's best friend? If the document was stored as plain JSON, you’d have to read through it field by field until you get to the “bestfriend” field—a slow endeavor when the field you’re interested in is stored behind pages of life history. If the metadata is instead stored as JSONB, a header at the beginning of the document tells you how many bytes to skip so that you can jump directly to the bestfriend field. The difference is negligible for one document, but substantial when scanning through potentially millions of documents.

On-disk encoding

We evaluated encoding based on the speed at which we could perform operations on the data and the number of bytes a given datum took to store.

As opposed to storing semi-structured data as an opaque sequence of bytes, JSONB can be manipulated from within the database. Because we expect our users to perform queries upon JSONB, we designed commonly used operations to avoid costly speed delays (e.g., decoding the entire document).

For example, JSONB allows this query to efficiently extract a nickname field (metadata>’nickname’) from a JSON document for a particular known event:

SELECT metadata->’nickname’ FROM insects WHERE species = ‘cockroach’

The query above can extract the “nickname” field from the metadata without decoding the entire JSON document thanks to the efficient JSONB encoding.

Unsurprisingly, we encoded JSONB to take up as few bytes as possible. Because certain byte patterns can be more or less amenable to compression, any solution for JSONB encoding also needed to make good use of the compression algorithms already present in CockroachDB. We used an approach adopted from Postgres in order to improve the compressibility of our encoding, which you can read more about in our RFC.

Unpredictably shaped data

Suppose you overheard your friend talking about their favorite insect, Craiggy. You know that your friend previously mentioned Craiggy’s species, but you can’t remember it now. Recalling the different fields present in JSONB, you decided to construct the following query to jog your memory:

SELECT species FROM insects WHERE metadata->’nickname’ = ‘Craiggy’

This query looks similar to the one above, but notice that this query now filters by JSON column. When filtering by species, the database can perform a quick lookup by using the index on species, but filtering by metadata->’nickname’ requires a full table scan because there is no index on that field. Further complicating matters, the ‘nickname’ field is not guaranteed to exist in the metadata JSON at all. Many developers would prefer to index on the important fields at the time of development, however, this can be challenging to predict because the JSON structure in a given column is often heterogeneous. What if you later need to filter by metadata->’best_friend’?

We expect customers to consistently need solutions for unpredictable data. This is why we decided to implement inverted indices (known as a GIN index in Postgres) alongside JSONB. Because inverted indices do not require you to know, or specify ahead of time, indexed fields, users don’t need to pre-specify the shape of their data. This allows you the flexibility to update your application as your business needs to change, so that you can always efficiently query your data.

CREATE INVERTED INDEX ON insects (metadata)

This index allows us to efficiently query any field in the metadata JSON, no matter what information our customers want to store. You could work around this lack of knowledge about your customers desired data storage by indexing a cross section of all fields with comparatively cheap storage. However, at this point you’d have a large amount of data indexed that does not provide additional value. Ideally, you would wish to index JSON columns on precisely the fields they each contain. Luckily, this is exactly how inverted indexing with JSONB works in CockroachDB.

Implementation of Inverted Indices

You can view a JSONB document as a set of paths from the root to the leaves.

For example, the following JSON document has four paths from the root to a leaf:

{ "a": {"b": 1} "c": {"d": 2, "e": 3}, "f": 4 } a/b/1 c/d/2 c/e/3 f/4

An inverted index creates one index entry for each of these, allowing efficient querying for any kind of JSONB document. For example, a user might ask for all documents x for which x.a.b equals 7. These queries are done using the JSONB “containment” operator:

SELECT x FROM some_table WHERE x @> ‘{“a”: {“b”: 7}}

Before we introduced JSONB and inverted indices, CockroachDB required you to know which fields need indexing in advance. Now, a single inverted index can support the containment queries you know about today as well as lay the foundation to support any future containment queries you might need.

If you’re curious to peek behind the curtain a bit more, we explored encoding, inverted indices, and more within our public RFC process.

When should I use structured data instead of JSONB?

JSONB performs best for frequently read and infrequently written data. Due to CockroachDB’s use of MVCC, a modification to a piece of data requires copying and retaining the old version of that piece of data. Further, since we replicate data three times to increase survivability, rewriting large pieces of data many times can hurt performance due to write amplification.

We know that many developers find JSON to be an intuitive and useful tool. However, using JSON will be less effective when a traditional relational schema will suffice. For example, every JSON document inefficiently stores all of its keys when compared to relational data (though this is alleviated somewhat due to our compression). Therefore, you should use JSON to efficiently manage rows with different types of values and hierarchical data. You can also use JSON to conduct rapid prototyping.

How do we work with our open-source community?

CockroachDB loves open-source contributions! Since JSON was our most requested feature on GitHub (coming in with more than 80 thumbs up, smileys, party hats, and hearts), we knew we’d like to get the community involved in its implementation. But how could we get you involved?

We opened a GitHub issue soliciting help from our community to implement the long tail of functions that operate on JSONB columns. While many people contributed to our final product, we’d like to offer a special thank you to Yihong He, who submitted PRs that covered multiple built-in functions.

As always, we welcome the community’s involvement in CockroachDB and look forward to more open-source contributions.

Try JSONB!

In the future, we will dive deeper into JSONB use cases to further explain how you can use CockroachDB and JSONB to make data easy.

You can try JSONB before it’s available in GA by downloading the latest 2.0 beta.

Illustration by Wenting Li