Data cardinality in a distributed DB: Lessons from a betting app

How can you get great performance out of your database when huge numbers of users are going to be submitting queries within the space of a few minutes or seconds?

That’s the challenge faced by Ultimate Tournament, a turnkey eSports betting platform that supports (among other things) parimutuel betting on eSports matches.

In parimutuel betting, bets on a match are pooled and the final payout odds are based on the pool totals for each outcome. Because these totals are continually changing as bets come in, the probable payout odds are also changing. Many betters wait until the last possible second to submit their bet so that they can see odds that are based on near-final betting pool totals.

It’s a betting strategy that makes sense, but one that can put a ton of pressure on the database, which will have to deal with a large number of incoming reads and writes in a very short timeframe. The larger the scale of the betting application, the greater the potential for degraded database performance.

Ultimate Tournament’s platform is built to support tremendous scale. So how can they ensure optimal performance? With a combination of the right database – CockroachDB a, distributed SQL database – and schema that’s designed to take advantage of that distributed architecture.

There’s a lot of complexity there, but one factor that’s often important – and that’s been important for Ultimate Tournament – is cardinality.

What is cardinality (in a database)?

In the context of a database, cardinality is a measure of the uniqueness of values in the data. Low cardinality means few unique values; high cardinality means many unique values.

For example, imagine we have a very simple table, cars, that stores data about the make (i.e. brand) and VIN number of various vehicles:

If we imagine this table has 100,000 rows, we know that there will be 100,000 unique values in the vin column (because VINs are unique, no two vehicles have the same VIN). But we can guess that there will be a much smaller number of unique values in the make column, as the most popular auto makes will appear over and over again in the table.

In this example, the make column has comparatively low cardinality, and the vin column has very high cardinality.

In the real world, though, we’re typically concerned with the cardinality of an entire table, or an index (which may combine data from multiple columns), or the results of a query. Calculating cardinality accurately involves getting into set theory, although we can sometimes get a decent approximation by multiplying the total number of unique values in each column.

(Typically, the database includes functions or tools to calculate the cardinality for us, though, so there’s no need to do the math for ourselves).

Cardinality in distributed databases

In the world of databases, cardinality is often discussed in the context of indexing and query performance. Often, we may be trying to reduce cardinality in order to improve query performance.

In a distributed database, however, there are additional considerations.

There’s more detail in CockroachDB’s architecture documentation, but to summarize: a distributed SQL database stores data in chunks called ranges. In CockroachDB, ranges are replicated at least three times across at least three different database nodes. Each range has a “leaseholder” replica/node that receives and coordinates the reads and writes for data on that range.

This matters because CockroachDB stores primary keys in a sorted order. The more similar these keys are to each other, the more likely they are to be stored on the same range, which would mean they have the same leaseholder, and thus reads/writes are being processed by the same node.

Conversely, dissimilar primary keys are more likely to be stored on different ranges. This is not a bad thing! In fact, CockroachDB recommends using UUIDs in many circumstances because their randomness ensures that as new rows are entered, they’re spread across different ranges (and thus different nodes) so that a single node isn’t having to do all of the processing.

At the same time, though, very high cardinality data still needs to be properly organized to ensure that it doesn’t damage insert rates and query performance, even when it’s spread across multiple nodes in a distributed system. If your application is running full table scans, that can impact performance across all of your nodes at once – and the higher the cardinality, the worse the impact will be.

Real-world example: optimizing for performance with Ultimate Tournament

Of course, cardinality is just one issue to consider when optimizing a distributed database for maximum performance at scale. In the video below, Ultimate Tournament CTO and co-founder Dan Goodman talks about the high cardinality of their parimutuel betting data, and how that factors into performance:

To summarize, Goodman set a couple of ironclad pillars of development for the Ultimate Tournament team: no full table scans, and no hot keys or hot rows (rows that are frequently added, deleted, modified, or queried). Each bet “slip” transaction includes the UUID for the relevant betting pool as well as a UUID for the individual bet, and the team has carefully designed its keys and indexes to avoid creating hot rows.

For example, with parimutuel betting, it’s critical that people can only bet on betting pools that are still open. Having many bets coming in at the last minute can lead to hot rows, so Ultimate Tournament created multiple indexes for the same table and “pinned” different indexes to different regions to spread out the processing requirements across multiple nodes.

But of course, that’s really just the tip of the iceberg. To learn more about how Ultimate Tournament used CockroachDB to build a massive-scale eSports betting platform, check out the full case study.