High-performance JSON parsing in Go

High-performance JSON parsing in Go
[ Guides ]

CockroachDB: The Definitive Guide

We wrote the book on distributed scale. Literally.

Free O'Reilly Book

JSON is a ubiquitous data interchange format supported by many systems, including CockroachDB. Historically, at CockroachDB, the majority of our efforts around JSON focused on supporting various JSON functions, operators, and inverted indexes while our JSON parser did not receive much attention. And for a good reason. 

If you are reading this blog, then you probably already appreciate Golang. There are many aspects to like in Go, and one of them is an excellent collection of standard (and third party) libraries. Becase Go ships with a “jack-of-all-trades” JSON parser out of the box (encoding/json library), it made sense for CockroachDB to use a standard parser when converting strings to internal JSON representation. However, at some point, we began to wonder if the time had arrived for us to invest in finding a better, more performant alternative.

This blog post is an exploration of JSON parser performance, and, ultimately, a description of the high-performance JSON parser used in CockroachDB. 

So, let’s jump into the world of state machines, parsing, performance, and all things JSON.

Understanding the problem: Inefficient ParseJSON functions

A good performance exploration should start with reason. Why is it necessary to improve a particular performance aspect? Is it worth the effort? This performance exploration begins with a customer who casually mentioned that they use a BYTES column in a table, even though they insert JSON. The reason cited by this customer was that their INSERT-heavy workload was multiple orders of magnitude slower when inserting into JSONB columns. They prefered to use JSONB, but couldn’t due to performance considerations.

Naturally, the above conversation piques the curiosity: are we as slow as the customer claims? (Side note: customer is always right, but sometimes their conclusions are not correct.)

Turns out, in this case, the customer was correct.  

We don’t need to go further than ParseJSON function which is invoked whenever CockroachDB needs to convert input STRING into JSON (for example, the following query INSERT INTO table (json_a, json_b) VALUES (‘{“key”: “value”}’::JSONB, ‘\[“a”, “r”, “r”, “a”, “y”]’::JSONB) will invoke a ParseJSON function twice – once to produce JSON object, and once to produce a JSON array of strings). 

ParseJSON function is obviously inefficient. So much so that it included a helpful comment explaining why it is inefficient: “Parsing goes in two phases - first it parses the string into raw interface{}s using the Go encoding/json package, then it transforms that into a JSON”.

A two phase approach makes sense – it’s simple, it’s small, but it’s obviously inefficient.  Read on to see just how inefficient it is.

How to write a benchmark to measure the performance of ParseJSON function

Before we can begin work on a performance improvement, we must first understand where the existing implementation falls short. To do this, we first must write a benchmark to measure the baseline performance: just how fast (or slow) the ParseJSON function is.

There is one more ingredient we need in order to understand performance: what is the representative input for our benchmark? Some customers insert small JSON blobs, some large; some use simple JSON objects, some use highly complex, nested structures, etc.  Luckily, we already have a testing library that generates JSON objects of varying complexity and size. 

RELATED Forward Indexes on JSON columns

Our job, therefore, is straightforward: generate a good corpus of JSON objects of varying sizes and complexity, and write a benchmark to measure the performance of ParseJSON function.

The results of the benchmark on a n2d-standard machine (AMD EPYC 7B13 processor, 2.5GHz, 32 cores) are shown below:

Max Len Iterations Time/op (ns) MB/s Avg. Input Len Alloc B/op Allocs/op
128 997956 6566 25.59 168.8 2439 39
512 593576 8887 47.94 426 3245 39
4096 190588 30081 97.14 2922 12976 41
16384 57171 101695 111.13 11302 45261 41
262144 3507 1477386 126.88 187453 723096 44
524288 1824 2800348 126.68 354745 1362211 45

The input data in the above benchmark had no escaped characters (no UTF8, no escaped quotes, etc).  

The “Max Len” column indicates the maximum string length in the randomly generated JSON string. This is not the same as the length of the JSON object – it just refers to the strings inside the JSON object values, and the “Avg Input Len” column indicates the average length of those inputs.

The MB/s indicates the throughput of parsing input JSON. Please note that smaller input benchmarks have lower throughput. This makes sense since we simply are not driving enough data to get a representative throughput number. 

First, the good news: ParseJSON, using Go standard encoding/json, has fairly stable allocation behavior – as our input size grows, the number of allocations per op stays fairly constant.

Now, the bad news: allocation bytes per op grows faster than linear: for the large input benchmark, we had an average input length of ~350KB, but we allocated almost 1.3MB to handle that input.

If you look at the small inputs, the memory allocation behavior is, arguably, worse: for an average input size of 168 bytes, we allocated 2439 bytes – 14 times more than the input size.

As input size increases, the time to process this input grows to 2.8 milliseconds per op – which is to be expected, but we will try to answer the question if this increase is excessive.

If we allow the input to contain escaped characters, such as “smiley face” emojis(U+1F600), the performance of encoding/json drops proportionally. For example, if we allow up to 3% of the input strings to contain escaped characters the performance of the base benchmark is as follows:

Max Len Iterations Time/op (ns) MB/s Avg. Input Len Alloc B/op Allocs/op
128 1042497 5577 31.02 173.6 2535 40
512 551197 10177 44.51 453.1 3753 41
4096 139592 40349 75.79 3058 16700 42
16384 49492 114133 104.78 11959 60130 43
262144 3232 1912541 101.41 193950 947012 46
524288 1501 3582581 108.62 389153 1864626 46

And allowing for almost “binary” data, with 40% of escaped characters:

Max Len Iterations Time/op (ns) MB/s Avg. Input Len Alloc B/op Allocs/op
128 902941 7192 28.22 203.7 2679 41
512 452046 12044 43.26 521.7 4004 40
4096 97638 55651 70.58 3929 20314 42
16384 29611 200888 74.64 14996 72766 43
262144 1892 3033428 78.91 239361 1127602 47
524288 1026 5694102 78.73 448318 2123032 46

Time per operation grows from 2.8 ms/op to 5.7 ms/op when processing heavily escaped, large inputs.

With these results in mind, we now have an immediate target for performance optimization: namely, we will try to improve allocation behavior. We could choose to work on improving time per op, or throughput, but we won’t, because the allocations behavior offers the largest opportunity for performance improvement. ParseJSON is inefficient – we know that because it’s a 2 phase Implementation. Removal of an intermediate step will certainly yield better allocation behavior.

In addition, improving allocation efficiency not only provides immediate benefits (i.e. better allocation behavior), but it also improves speed (fewer allocations – lower time per op) and it improves overall system efficiency, as reduced pressure on the Go garbage collector ultimately results in the performance improvement of the system as a whole. 

To rewrite or not to rewrite…?

That is the eternal question. A good rule of thumb is to start with a “no, let’s not rewrite”, and then see if you are still convinced to try something new.

We know that we are looking for a library that has a better allocation behavior, allows us to implement a single pass conversion from string to internal JSON representation, and (hopefully) runs faster than encoding/json.

The good news is that there are many libraries available that claim to be faster than the standard Go library. There are numerous, one can even say overwhelmingly numerous, options available. But which one is the best?

If you are looking for a silver bullet answer to the above question, I’m sorry to disappoint you.  The answer to that question is – it really depends on your use case. For example, ffjson is reasonably fast, but it is not appropriate for our use case because it relies on the code generation, whereas we need to convert arbitrary input strings (i.e. schemaless inputs) into internal JSON representation. This requirement eliminates this library, along with other libraries that use code generation techniques.

Other libraries, such as jsonparser seem to be optimized toward single or few keys retrieval.  However, we need to convert input into internal JSON representation, and thus we would need to traverse all values in the parsed JSON, likely resulting in suboptimal performance.

Other implementations use cgo – something we wanted to avoid if possible.

The library that seemed to hit all of our goals was jsoniter. Not only is the jsoniter library reasonably fast, it also exposes the Iterator API that allows us to build the final JSON object on the fly as we parse tokens from the input stream. Overall, this is an excellent library. We switched the internal implementation to use the jsoniter library with excellent results (see https://github.com/cockroachdb/cockroach/pull/89117):

name old speed new speed delta
ParseJSON/escape=0.00%/128-32 30.6MB/s ± 4% 51.2MB/s ± 3% +67.60% (p=0.000 n=9+10)
ParseJSON/escape=0.00%/512-32 58.4MB/s ± 6% 117.5MB/s ± 1% +101.26% (p=0.000 n=10+10)
ParseJSON/escape=0.00%/4096-32 110MB/s ± 1% 344MB/s ± 3% +211.29% (p=0.000 n=8+10)
ParseJSON/escape=0.00%/16384-32 133MB/s ± 1% 532MB/s ± 3% +300.49% (p=0.000 n=9+10)
ParseJSON/escape=0.00%/65536-32 144MB/s ± 1% 628MB/s ± 4% +336.42% (p=0.000 n=9+10)
ParseJSON/escape=0.00%/262144-32 138MB/s ± 1% 710MB/s ± 1% +415.99% (p=0.000 n=8+10)
ParseJSON/escape=0.00%/524288-32 139MB/s ± 1% 669MB/s ± 7% +381.80% (p=0.000 n=9+10)

The improvements in throughput are great – going up by almost 4 times. The one area that was a bit of a concern was that jsoniter seems to be less efficient in its allocation characteristics than the standard library when dealing with any data containing escaped characters:

name old speed new speed delta
ParseJSON/escape=0.00%/128-32 40.0MB/s ± 0% 41.0MB/s ± 0% +2.50% (p=0.000 n=10+10)
ParseJSON/escape=3.00%/512-32 41.0MB/s ± 0% 43.0MB/s ± 0% +4.88% (p=0.000 n=10+10)
ParseJSON/escape=3.00%/4096-32 42.0MB/s ± 0% 48.0MB/s ± 0% +14.29% (p=0.000 n=10+10)
ParseJSON/escape=3.00%/16384-32 43.0MB/s ± 0% 53.0MB/s ± 0% +23.26% (p=0.000 n=10+10)
ParseJSON/escape=3.00%/262144-32 46.0MB/s ± 0% 64.0MB/s ± 0% +39.13% (p=0.000 n=10+8)
ParseJSON/escape=3.00%/524288-32 45.7MB/s ± 2% 67.0MB/s ± 0% +46.61% (p=0.000 n=10+10)

We wanted to find a library that was faster, and had a better memory allocation behavior. Since jsoniter appeared to have worse allocation behavior than standard go library, we decided to continue exploration.

Set the speed to 11

Still resisting the urge to rewrite, and trying to find a more memory efficient implementation than jsoniter, Google yielded a very interesting blog post on high-performance JSON parsing, which described parser implementation that produced over 1GB/s throughput, while promising “eliminate allocations through API design”.  That sounded very intriguing, and almost too good to be true. The package, pkg/json described in that post is listed as “not ready for production” – and for good reasons, which we’ll discuss below. However, it was worth a try, and the initial benchmarks seemed to confirm that, as it says on the “sticker”, the package can deliver over 1GB/s parsing throughput.

So, are we set on a “no rewrite” approach? Not quite. There were many things missing from pkg/json package. In particular, it did not have correct handling of escaped characters, and it did not handle UTF8 escape sequences at all. What it did have was an excellent API, and a great starting point for us to build upon. 

I think it’s worth pausing for a second to appreciate the point of API design. This package choice to return a byte array to represent the next token in the input stream, as opposed to returning json.Token (encoding/json package) significantly improves allocation behavior since json.Token is an interface{}, which causes the result to escape to heap (i.e. cause more allocations). API choices matter. If the goal is to produce a high-performance library, you must think about those details.

We decided to take the pkg/json package, fork it, add missing functionality, and adapt it for our internal use.  

Max Len Iterations Time/op (ns) MB/s Avg. Input Len Alloc B/op Allocs/op
128 2529363 2206 76.15 168.8 969 26
512 2257388 2489 171.14 426.1 1229 26
4096 1139653 4688 623.04 2922 3916 26
16384 454411 11675 968.32 11305 12710 26
262144 36289 157838 1180.9 186392 191801 26
524288 19600 302933 1193.54 361565 366902 26

Across the board, this implementation used fewer, smaller allocations. Speed is set to 11 out of 10, so to speak.  

name old speed new speed delta
ParseJSON/escape=0.00%/128-32 30.6MB/s ± 4% 66.9MB/s ± 3% +118.89% (p=0.000 n=9+8)
ParseJSON/escape=0.00%/512-32 58.4MB/s ± 6% 153.6MB/s ± 1% +163.12% (p=0.000 n=10+8)
ParseJSON/escape=0.00%/4096-32 110MB/s ± 1% 501MB/s ± 2% +353.49% (p=0.000 n=8+8)
ParseJSON/escape=0.00%/16384-32 133MB/s ± 1% 769MB/s ± 2% +478.79% (p=0.000 n=9+9)
ParseJSON/escape=0.00%/262144-32 138MB/s ± 1% 1038MB/s ± 2% +654.61% (p=0.000 n=8+10)
ParseJSON/escape=0.00%/524288-32 139MB/s ± 1% 1110MB/s ± 0% +699.16% (p=0.000 n=9+8)
ParseJSON/escape=0.00%/524288-32 139MB/s ± 1% 669MB/s ± 7% +381.80% (p=0.000 n=9+10)

Parsing data without escape sequences is 8 times faster vs encoding/json, and almost twice as fast as jsoniter.

What about “elimination of allocations through API design” – clearly, we make allocations? Well, the above results ultimately produce a final, CockroachDB specific JSON object – and thus we must perform some allocations. The underlying pkg/json library does not do almost any allocations on its own. Our allocation behavior now is reflective of the input size while the number of allocations remains stable across the board. Allocation behavior is stable even when decoding data with 40% escape sequences:

Max Len Iterations Time/op (ns) MB/s Avg. Input Len Alloc B/op Allocs/op
128 2012869 2849 71.26 203.7 996 27
512 1396340 4113 126.67 521.7 1217 26
4096 285523 18243 215.37 3929 4010 26
16384 86448 62476 239.6 14970 12863 26
262144 5745 972994 249.86 243116 190634 27
524288 2439 2091165 232.39 485975 376054 26

The performance, when data contains up to 40% of escaped data, is “only” 2.25 times that of Go standard library. However, comparing with the allocations behavior, our implementation significantly outperforms both encoding/json and jsoniter. Why such differences between escaped and unescaped data? Without escapes, we simply consume input and produce direct translation to the JSON object. With escape sequences present, we must copy data into a temporary buffer as we go along – and copying is not free.

The blog post about building a high-performance JSON parser mentioned above is a must read.  It contains a treasure trove of information on building a high-performance parser.  We have utilized many techniques mentioned in the above blog post.  

For example, utilizing method expression to store the next state transition in the state machine results in around 5% performance gain. Arguably, not much, but 5% here, 5% there, and soon enough you’re talking real gains.

[ case study ]

Hard Rock, anytime, anywhere: Scaling a performant, compliant and entertaining experience

read case study →

Another valuable technique is to compile code with escape analysis to understand allocation behavior. Compiling Go code with the “-gcflags=-m=2” flag produces lots of interesting output indicating whether variables escape, or whether or not the function can be inlined (you can “grep” for specific files/functions).

For example, if we look for buffer.go, we see:

pkg/util/json/tokenizer/buffer.go:74:6: can inline grow with cost 51 as: func(\[]byte, int) \[]byte { if n < smallBufferSize { n = smallBufferSize }; if buf == nil { return make(\[]byte, n) }; c := len(buf) + n; if c < 2 \* cap(buf) { c = 2 \* cap(buf) }; nb := append((\[]byte)(nil), make(\[]byte, c)...); copy(nb, buf); return nb\[:cap(nb)] }
pkg/util/json/tokenizer/buffer.go:36:6: can inline (\*buffer).AppendByte with cost 79 as: method(\*buffer) func(byte) { if cap(b.buf) - b.l <= 1 { b.buf = grow(b.buf, 1) }; b.buf\[b.l] = c; b.l += 1 }
pkg/util/json/tokenizer/buffer.go:38:15: inlining call to grow

We can see that some of the functions that are called on a “hot loop” can be inlined (grow function costs 51, and AppendByte costs 79). The (current) threshold for inlining the function is 80. I do not recommend hunting for inline optimizations as a general rule of thumb. But when they work, the performance improvements could be noticeable. Sometimes, making a small tweak can make a difference between a hot function being inlinable or not.

Final Thoughts

This has been a fun optimization project. Open source offers many alternatives to almost any problem. Explore what’s out there before jumping into your own implementation. It is important to understand the reason for attempting a particular optimization approach and it is equally important to have clear goals in mind. A word of caution: do not forget about the reader of your code. A modest optimization in the code sometimes results in an incomprehensible code – those should be avoided when possible.

Final, final thought: we might be able to backport our changes to the pkg/json package so that you all can benefit. 

Huge thanks to all of the authors who wrote open source packages and blog posts mentioned here.

Keep Reading

How to model JSON data in a Go app with CockroachDB

*Guest post alert! Jack is a core maintainer of pgx, a PostgreSQL driver and toolkit for Go. He helped build the testing …

Read more
Rubbing control theory on the Go scheduler

For multi-tenant mixed-workload systems like CockroachDB, performance predictability and isolation are critical. Most …

Read more
How to optimize write latency for global tables in CockroachDB

Achieving low latency reads in a transactionally consistent, multi-region database is a unique challenge. In …

Read more