Improving unordered distinct efficiency in the vectorized SQL engine

Last edited on July 9, 2020

0 minute read

    For the past four months, I’ve worked as a backend engineering intern with the incredibly talented engineers of the Cockroach Labs SQL Execution team to unlock the full potential of the vectorized engine ahead of the CockroachDB 20.1 release. During this time, I focused on improving the algorithm behind some of CockroachDB’s SQL operators, to reduce memory usage and speed up query execution. In this blog post, I will go over the improvements that I made to the algorithm behind the unordered distinct operator, as well as new algorithms that address some existing inefficiencies.

    In CockroachDB’s 20.1 release, we removed the experimental flag on the vectorized execution engine. If the estimated number of rows scanned is greater than a certain threshold (1000 rows, by default), CockroachDB runs a subset of queries through the vectorized engine. Users can also use the vectorized engine to execute all supported queries using the following command in their SQL shell:

    set vectorize=on;

    With this setting, CockroachDB will use the vectorized engine by default and only fall back to the old engine if it encounters unsupported queries.

    The Vectorized Execution EngineCopy Icon

    The vectorized execution engine is an interesting piece of technology that significantly speeds up SQL queries. In our early benchmarks, the new engine can be up to 40x faster than our existing row-at-a-time SQL engine. You can read more about how we built the vectorized engine in How We Built a Vectorized Execution Engine.

    When the vectorized engine finishes fetching all tuples into memory, it stores all of them in a columnar memory layout. This enables us to build SQL operators that are friendly to modern CPU caching architectures and branch predictors, and can greatly outperform our existing row-at-a time engine. However, these improvements have come at a cost. Columnar memory layout is great when it comes to speeding up operations, but sometimes it makes it difficult to implement complex operations, like SQL joins or groupings. Also, we need to be careful with our implementation such that we don't take any shortcuts, which can cause the vectorized engine to essentially perform like the old row-at-a-time engine. For example, if the logic within a tight loop is very complicated or involves expensive operations, such as having too much branching and on-the-fly memory allocation, it can offset the benefits of having a columnar memory layout.

    In order to understand the concepts presented in this blog, we recommend you read our previous blog post, 40x faster hash joiner with vectorized execution.

    Unordered DistinctCopy Icon

    Unordered distinct is a SQL operator that performs distinct operations on the given columns of the input. It removes the tuples that have duplicate values on the distinct columns from the query result. As its name suggests, it is designed to work with inputs where the distinct columns are unordered. The unordered distinct operator needs to keep track of all distinct tuples that it has come across throughout its lifetime.

    Naively, the operator can store all unique tuples inside an in-memory key-value dictionary. For all subsequent tuples it encounters, the operator only needs to perform a simple lookup in the dictionary to determine if this tuple has already been seen. This is because within each iteration of the loop, the operator needs to perform multiple steps of computation:

    1. Hash the key columns of the input tuples.

    2. Perform a dictionary lookup using the tuple's hash.

    3. If the same hash key exists, perform an actual comparison between the new tuple and the tuple with the same hash key in the dictionary to check if it is a hash collision.

    4. Perform a dictionary insertion if the new tuple is actually unique.

    Each iteration involves a hash computation, multiple branching and value comparisons (which can be very expensive, depending on the SQL data types), and insertion (which can lead to memory allocation and copying). The point of the vectorized engine is to reduce the one big loop into multiple small loops, with very simple operations inside the loop bodies. This can improve the program's cache friendliness and reduce branch misprediction inside the CPU.

    As of CockroachDB 19.2, the vectorized unordered distinct operator was implemented using our own vectorized hash table. Its implementation was based on Marcin Zukowski's paper "Balancing Vectorized Query Execution with Bandwidth-Optimized Storage". The vectorized hash table was the key to our implementation of an efficient vectorized hash joiner.

    Because the original vectorized hash table was designed for the hash join operator, it stores a lot of extra information that is not necessary for the unordered distinct operator. For example, in order to implement a hash join, the hash table needs to store all tuples, not just distinct tuples. In addition to storing all tuples, the hash table needs to build a linked list for each bucket to efficiently traverse through collisions. In the unordered distinct operator, none of those auxiliary data structures and computation are required. The only thing we care about is the first distinct tuple we encounter. As a result, we can discard all subsequent tuples that are identical to the first one, significantly improving the performance of the unordered distinct operator.

    The Vectorized Hash TableCopy Icon

    CockroachDB separates the construction of a hash table into multiple stages. In the first stage, it consumes all of its input and buffers all tuples in memory. It then hashes each buffered tuple, and groups tuples that have the same hash values into buckets. For each bucket, the hash table constructs a linked list in order to traverse through all of the tuples. Let’s call this linked list a hash chain.

    After hash chains for each bucket are constructed, the hash table traverses through each bucket to check if there are any hash collisions. For each bucket, the hash table creates a “same” linked list for each distinct value inside. The idea of this new linked list is similar to the hash chain. The only difference is the hash chain enables us to traverse through all tuples in each bucket that have the same hash value, and the “same” linked list allows us to traverse through all identical tuples. At this point, the construction of the hash table is complete.

    To better illustrate the algorithm, let’s look at an example. Consider an input table with three columns: a, b and c. Suppose that each input tuple is hashed to either 1 or 2, using some hash function, and a hash collision occurs.

    Input: hash buffer: +---+----+-----+ +-------+------+ | a | b | c | | index | hash | +---+----+-----+ +-------+------+ | 2 |null|"2.0"| | 0 | 0 | <- hash collision \ +---+----+-----+ +-------+------+ * | 2 |1.0 |"1.0"| | 1 | 1 | | +---+----+-----+ +-------+------+ | | 2 |null|"2.0"| | 2 | 0 | <- hash collision | +---+----+-----+. -----> +-------+------+ * | 2 |2.0 |"2.0"| | 3 | 0 | <- hash collision | +---+----+-----+ +-------+------+ * | 2 |null|"2.0"| | 4 | 0 | <- hash collision / +---+----+-----+ +-------+------+ | 2 |1.0 |"1.0"| | 5 | 1 | +---+----+-----+ +-------+------+

    The hash buffer is then used to construct the hash chains for each bucket. We represent all hash chains in the hash table using two arrays,“first” and “next”. Each entry in the “first” array maps from a hash value to an entry in the “next” array. This entry in the “next” array represents the beginning of the hash chain linked list. We use a value of 0 to represent the end of the linked list.

    first: next: +------+-------+ +-------+------+ | hash | keyID | | keyID | next | +------+-------+ +-------+------+ | 0 | 5 | | 0 | N/A | <- not used, reserved +------+-------+ +-------+------+ | 1 | 6 | | 1 | 0 | <- tail of hash chain of +------+-------+ +-------+------+ hash value 0 | 2 | 0 | <- tail of hash chain of +-------+------+ hash value 1 | 3 | 1 | +-------+------+ | 4 | 3 | +-------+------+ | 5 | 4 | <- head of hash chain of +-------+------+ hash value 0 | 6 | 2 | <- head of hash chain of +-------+------+ hash value 1

    The zeroth element of the “next” array is not used. As a result, we introduce a “keyID” column computed from the index. To get the value for “keyID”, we shift the index values of each entry by one. So, keyID = index + 1. The value of the “next” array entry simply points to the key ID of the next entry in the same hash chain linked list.

    If we rewrite the result from the “first” and “next” array in a more readable form, we have this:

    +------------+---------------------------------------+ | hash value | keyIDs for the hash chain linked list | +------------+---------------------------------------+ | 0 | 5 -> 4 -> 3 -> 1 | +------------+---------------------------------------+ | 1 | 6 -> 2 | +------------+---------------------------------------+

    Note that for each hash chain, the key ID values are in a monotonically decreasing order. This property is not important for the hash joiner, but it can cause some trouble for unordered distinct. We will revisit this later in the blog.

    The next step of the algorithm is to probe through each hash chain we have just built, and resolve any hash collisions. The hash table goes through every single tuple that it has buffered so far. For each tuple, it calculates its hash value and looks up the hash chain corresponding to this hash value. Once the corresponding hash chain is found, each tuple compares itself with the head of the hash chain. If their values are different, we have encountered a hash collision, since different values are hashed to the same hash value. For more details on how we implemented this step, see the blog post on hash joiners. The result from the hash collision detection stage is also a set of linked lists similar to the ones we used to store hash chains. Conceptually, this looks like the following:

    +-------------------------------+---------------------------------------+ | keyID of the linked list head | linked list of keyIDs with same value | +-------------------------------+---------------------------------------+ | 5 | 5 -> 4 -> 1 | +-------------------------------+---------------------------------------+ | 3 | 3 | +-------------------------------+---------------------------------------+ | 6 | 6 -> 2 | +-------------------------------+---------------------------------------+

    Note that this is different from the hash chain we mentioned earlier. In hash chains, we use linked lists to point to all tuples that have the same hash value. Now, the linked list contains tuples whose values are equal. While this information is important for the hash joiner in order to ensure the correctness of the output, it isn’t very useful for unordered distinct. As a result, when we find the head of the linked list, we can skip building out the linked list. This simple optimization actually yields about 35 ~ 40% of increase of performance for the unordered distinct operator.

    name old speed new speed delta numCols=1/nulls=false/numBatches=1-8 128MB/s 181MB/s +41.41% numCols=1/nulls=false/numBatches=16-8 107MB/s 145MB/s +35.51% numCols=1/nulls=false/numBatches=256-8 45.1MB/s 82.6MB/s +83.15% numCols=1/nulls=true/numBatches=1-8 71.5MB/s 112.1MB/s +56.78% numCols=1/nulls=true/numBatches=16-8 65.6MB/s 98.9MB/s +50.76% numCols=1/nulls=true/numBatches=256-8 22.8MB/s 41.5MB/s +82.02% numCols=2/nulls=false/numBatches=16-8 146MB/s 190MB/s +30.14% numCols=2/nulls=false/numBatches=256-8 50.5MB/s 103.1MB/s +104.16% numCols=2/nulls=true/numBatches=1-8 97.2MB/s 131.6MB/s +35.39% numCols=2/nulls=true/numBatches=16-8 82.0MB/s 111.2MB/s +35.61%

    Can we do even better than this? Of course!

    Recall that in the initial stage of building the vectorized hash table, the engine buffers all tuples into memory first, and then proceeds to perform subsequent computations. This was designed so that the hash table has all the information it needs for the hash joiner. However, the unordered distinct operator only needs the distinct tuples. This is key to our second optimization: we only need to store the distinct tuples in the hash table! This optimization strategy also limits the memory consumption of unordered distinct to be proportional to the number of distinct tuples in the input instead of the total number of tuples in the input. Also, since we only buffer distinct tuples into memory, we reduce the total number of comparisons to perform throughout the entire algorithm.

    Vectorized Hash Table Distinct ModeCopy Icon

    After the unordered distinct operator finishes constructing the hash table, the operator goes through every single “same” linked list and copies the head of each linked list into its internal buffer. When the operator finishes processing all “same” linked lists, it outputs all the buffered tuples it has copied from the hash table. Observe that as the operator goes through each “same” linked list and copies only the head of each linked list, the memory access pattern is random instead of sequential. This access pattern is quite unfriendly to the cache inside the CPU, meaning the operator will have a large number of cache misses as it goes through each “same” linked list. We can improve this by adding a distinct mode to the hash table to avoid this random access pattern. This means that the output of the hash table should only contain the distinct tuples and the operator can simply sequentially copy it to its internal buffer.

    In order to fully implement the distinct mode in the vectorized hash table, we need to modify the algorithm that builds the hash table. First, since the main goal is to eliminate the need to buffer the entire input into memory, we shouldn't consume the entire input and then buffer all tuples in memory in the beginning. Instead, we can build the hash table incrementally. This is quite simple to implement. When we fetch a new batch of tuples, we first perform deduplication within that batch. After all duplicate tuples are removed from the batch, we check the remaining tuples within that batch against the already-buffered tuples to see if there are any additional duplicate tuples. Since we only insert batches that strictly contain unique tuples, we can be sure that there are no duplicate tuples within the buffered tuples. When the second duplication check is completed, we can simply append all tuples remaining in the new batch directly into the hash table, and the uniqueness invariant will still hold. This approach is reflected in the following high-level code snippet:

    func build() hashTable { ht := hashtable{} for { batch := input.Next(ctx) if batch.Length() == 0 { break } buildHashChains(batch) // Removes the duplicate tuples within the new batch. removeDuplicatedTuples(batch, batch) // Removes the tuples from the new batch that already have duplicates // in the hash table. removeDuplicatedTuples(batch, ht.bufferedTuples) ht.bufferedTuples.Append(batch) } return ht }

    Now, if you are familiar with the hash collision handling scheme we implemented in our previous blog post, you will realize that the process of de-duplication and checking for hash collision is quite similar. In fact, if you squint hard enough at the code, they are almost identical, with some slight differences.

    There are two phases to how we perform deduplication. In the first phase, we remove duplicate tuples within the batch itself. For each tuple, we traverse through the hash chain that has been built previously. As soon as it finds the first identical tuple in the hash chain, we stop the linked list traversal. The algorithm itself is quite straightforward. However, if you look closely, you will notice that this optimization cannot completely guarantee the correctness of the algorithm.

    We mentioned previously that once we build the hash chain, the values of the keyIDs inside the hash chain will be in a monotonically decreasing order. In other words, the head of the hash chain linked list is going to be the tuple with the largest keyID. The deduplication algorithm immediately stops the hash chain traversal when it encounters a tuple that has the identical value. That tuple is emitted as the distinct tuple. Because the order of the hash chain is reversed, the head (i.e., the last occurrence of the tuple inside a batch) will be the one that gets emitted. Recall the previous example:

    first: next: +------+-------+ +-------+------+ | hash | keyID | | keyID | next | +------+-------+ +-------+------+ | 0 | 5 | | 0 | N/A | <- not use, reserved +------+-------+ +-------+------+ | 1 | 6 | | 1 | 0 | <- tail of hash chain of +------+-------+ +-------+------+ hash value 0 | 2 | 0 | <- tail of hash chain of +-------+------+ hash value 1 | 3 | 1 | +-------+------+ | 4 | 3 | +-------+------+ | 5 | 4 | <- head of hash chain of +-------+------+ hash value 0 | 6 | 2 | <- head of hash chain of +-------+------+ hash value 1

    In this case, because we start probing from keyID=5 and keyID=6, we will eventually emit the following keyIDs: 5, 6 and 4. However, inside CockroachDB's SQL engine, different SQL operators can be used as building blocks for more complicated queries. Making the unordered distinct operator deterministically output the first distinct tuple it encounters makes things even easier for SQL operators and the SQL planner.

    The solution to this problem is essentially to reverse the order of the linked list. Since the deduplication process starts with the head of the hash chain, if we can ensure the value of keyIDs within the hash chain is in monotonically increasing order, we can guarantee that the hash table will emit the first tuple it encountered. As the result, we have the following result:

    first: next: +------+-------+ +-------+------+ | hash | keyID | | keyID | next | +------+-------+ +-------+------+ | 0 | 1 | | 0 | N/A | <- not use, reserved +------+-------+ +-------+------+ | 1 | 2 | | 1 | 3 | <- head of hash chain of +------+-------+ +-------+------+ hash value 0 | 2 | 6 | <- head of hash chain of +-------+------+ hash value 1 | 3 | 4 | +-------+------+

    Here's the benchmark for the unordered distinct operator with updated algorithm:

    name old speed new speed delta newTupleProbability=0.001/rows=4096/cols=2 154MB/s 156MB/s ~ newTupleProbability=0.001/rows=4096/cols=4 219MB/s 236MB/s ~ newTupleProbability=0.001/rows=65536/cols=2 136MB/s 384MB/s +181.65% newTupleProbability=0.001/rows=65536/cols=4 245MB/s 518MB/s +111.55% newTupleProbability=0.010/rows=4096/cols=2 154MB/s 155MB/s ~ newTupleProbability=0.010/rows=4096/cols=4 214MB/s 249MB/s +16.29% newTupleProbability=0.010/rows=65536/cols=2 181MB/s 373MB/s +106.24% newTupleProbability=0.010/rows=65536/cols=4 225MB/s 504MB/s +123.53% newTupleProbability=0.100/rows=4096/cols=2 150MB/s 147MB/s ~ newTupleProbability=0.100/rows=4096/cols=4 207MB/s 204MB/s ~ newTupleProbability=0.100/rows=65536/cols=2 165MB/s 301MB/s +82.26% newTupleProbability=0.100/rows=65536/cols=4 201MB/s 398MB/s +97.83% newTupleProbability=0.001/rows=4096/cols=2 113MB/s 122MB/s +8.02% newTupleProbability=0.001/rows=4096/cols=4 141MB/s 168MB/s +19.43% newTupleProbability=0.001/rows=65536/cols=2 138MB/s 236MB/s +71.44% newTupleProbability=0.001/rows=65536/cols=4 156MB/s 282MB/s +80.65% newTupleProbability=0.010/rows=4096/cols=2 109MB/s 117MB/s +6.74% newTupleProbability=0.010/rows=4096/cols=4 143MB/s 164MB/s +14.59% newTupleProbability=0.010/rows=65536/cols=2 100MB/s 226MB/s +125.29% newTupleProbability=0.010/rows=65536/cols=4 142MB/s 272MB/s +91.53% newTupleProbability=0.100/rows=4096/cols=2 108MB/s 112MB/s +3.43% newTupleProbability=0.100/rows=4096/cols=4 132MB/s 150MB/s +13.62% newTupleProbability=0.100/rows=65536/cols=2 107MB/s 191MB/s +78.89% newTupleProbability=0.100/rows=65536/cols=4 125MB/s 237MB/s +89.61%

    We can see that depending on the characteristics of the input data, we can have a performance increase of up to 181% on top of our previous performance boost of ~35% to ~40%. Also, we see a drastic drop in the memory footprint of the unordered distinct operator:

    name old alloc/op new alloc/op delta newTupleProbability=0.001/rows=65536/cols=2 8.17MB 1.17MB -85.68% newTupleProbability=0.001/rows=65536/cols=4 14.0MB 1.2MB -91.34% newTupleProbability=0.010/rows=65536/cols=2 8.17MB 1.20MB -85.29% newTupleProbability=0.010/rows=65536/cols=4 14.0MB 1.3MB -90.92% newTupleProbability=0.100/rows=65536/cols=2 8.17MB 1.77MB -78.27% newTupleProbability=0.100/rows=65536/cols=4 4.0MB 2.2MB -83.99%

    I hope you enjoyed learning how our vectorized hash table works under the hood and how we continuously evolve our algorithms and implementations. What I presented in this blog post is only the tip of the iceberg of what we have worked on for the vectorized engine in the CockroachDB 20.1 release. I am very grateful for the opportunity to work with the absolute brilliant engineers at Cockroach Labs and I want to really thank my mentor Alfonso, my manager Jordan and my teammate Yahor for this amazing internship. Thank you Cockroach Labs, it has been a great journey.

    If you're interested in working at Cockroach Labs, we have both internships and full-time positions available.

    intern projects
    vectorized engine
    hash tables