A recent tweet inspired me to address the need for fuzzy matching by combining some existing capabilities of CockroachDB. Note the key features mentioned in the tweet:
similar but not equal sporting events names: a common pattern. Users tend to mis-type data in input fields, and data isn’t always correct. Nevertheless, we’d like to return the closest match.
I’d rather use this in-built feature than pay for a whole ES cluster with added maintenance overhead to boot: This is the second time I’ve heard this sentiment in the past couple of months. ES is a full-featured search engine and delivers a great experience but, for this purpose, would be overkill and would require additional time and expense to deploy and operate.
What I found interesting about this exercise is that this pattern of configuring a changefeed to route events through an external system, then back into the database has enormous potential, far beyond the fuzzy matching example that you’ll read about below. It seems that development practices have evolved away from burying logic in the database and towards coding it in the languages most applicable to the problem; the emergence of microservices aligns with this trend. That observation, combined with a new feature of being able to define a changefeed on a specific column family, has me convinced we’ll see some very interesting applications of this pattern with CockroachDB.
Now I’ll get off of my soapbox and into the demo!
Demo Background
To begin with, CockroachDB isn’t a fork of PostgreSQL, so you don’t simply “bolt on” the usual Postgres extensions such as pg_trgm
, the module mentioned in the tweet. But this is a small obstacle and, in fact, offers the opportunity to demonstrate an up-and-coming feature of CockroachDB Enterprise Changefeeds, aka “CDC”: the ability to configure a changefeed on a specific column family. This feature is available in the v22.1.0-beta.1
version I am using here – this release should be generally available by mid-May 2022.
The fuzzy matching experiment
Desired state
We do INSERT, UPDATE, DELETE on data about sports teams and would like to retrieve this data based on queries where we may misspell names. I confess to having misread the tweet, interpreting it as pertaining to sports team names as opposed to sport events names, so what I show here uses team name data, but I think it’s easy to apply to event names as well, so long as you have access to that data.
If we envision the requirement arises out of an application’s need to perform this type of matching, we can build a simple REST app which will:
Provide a webhook endpoint to which CockroachDB changefeeds will send events
Provide a REST endpoint for search, returning a ranked list of the top-N closest team name matches
DDL for the table
Here’s the DDL for the table. The FAMILY
… parts of this enable the changefeed to generate events based solely on FAMILY f1
, ignoring the resulting changes to FAMILY f2
, which is what we need:
CREATE TABLE teams
(
id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
, name TEXT NOT NULL
, grams TEXT[]
, FAMILY f1 (id, name)
, FAMILY f2 (grams)
);
And here’s the inverted index on the grams
column:
CREATE INDEX ON teams USING GIN (grams);
REST app: CDC / indexing
Now, we need the REST app. I use Python for this since it’s concise and easy to get up and running, and the Flask module works very well for REST apps. And, I had some code fragments hanging around which I could reuse pretty easily. The entire Python script is here, but I’ll focus on interesting aspects below.
Generating n-grams from strings:
def get_ngrams(s, n=3):
return [s[i:i+n] for i in range(len(s) - n+1)]
For the input la galaxy
, the return value is ['la ', 'a g', ' ga', 'gal', 'ala', 'lax', 'axy']
. The default value of n
is 3, which aligns with the way pg_trgm
works.
A couple of things to point out:
The text is lower cased by a separate Python function. This is common in information retrieval applications, since comparisons are typically done without regard to case.
The ngrams show the result of sliding a window across the string, from left to right, yielding 3-character sequences which can span the space between words. This latter aspect factors in the adjacent words, so phrases are scored appropriately.
Webhook endpoint for the changefeed:
@app.route('/cdc', methods = ['POST', 'GET'])
def cdc_webhook():
obj = request.get_json(force=True)
print("CDC: " + json.dumps(obj)) # DEBUG
for o in obj["payload"]:
if o["after"] is None:
pass # Nothing to be done here
else:
pk = o["after"]["id"]
name = o["after"]["name"]
index_string(pk, name)
return "OK", 200
The changefeed will send an HTTP POST to this /cdc
endpoint, as configured via the SQL statement shown below. When that happens, JSON is accessible by the call to request.get_json(force=True)
, and that JSON contains the current values of the id
and name
columns from the teams
table.
Those values are passed to the
index_string(pk, name)
function, which just updates thegrams
column of theteams
table:
def index_string(pk, content):
ng = tokenize(content)
stmt = text("UPDATE teams SET grams = :grams WHERE id = :pk").bindparams(grams=ng, pk=pk)
run_statement(stmt)
I won’t get into the run_statement(stmt)
function here, but will instead refer interested readers to the code itself.
Here is the SQL statement we need to run to configure the changefeed, tying all of this together:
CREATE CHANGEFEED FOR TABLE teams FAMILY "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;
Note that this operation requires an Enterprise License, but is also available in the single node “demo” mode (see below).
REST app: search / fuzzy matching
A REST client can retrieve fuzzy matches for a given team name by doing something equivalent to this (this example was run on my MacBook; the base64
command works differently here than it does on Linux; pretty_print_json.py
is here):
$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | pretty_print_json.py
[
{
"name": "LA Galaxy",
"pk": "15e240a7-d1db-4b77-b454-c895a11610bf",
"score": "42.8571"
},
{
"name": "LA Galaxy II",
"pk": "85b5b97a-9a6e-4cef-b63c-7cbe123eca07",
"score": "10.7143"
},
{
"name": "LA Giltinis",
"pk": "82b96763-6836-4ab8-84ad-1864a1f3e16d",
"score": "4.7619"
},
{
"name": "Tampa Mayhem",
"pk": "6f5fd5e0-6654-4385-9bd0-c191f4f1c5b4",
"score": "3.5714"
},
{
"name": "Tampa Tarpons",
"pk": "1851c8e9-77e8-456b-a197-6aaed971942a",
"score": "2.8571"
}
]
real 0m0.196s
user 0m0.063s
sys 0m0.043s
Before getting into the Python code for the /search
endpoint, the following are worthy of mention:
I deliberately misspelled the name of the team; “LA Galaxy” was the one I wanted.
Even though the initial character, the ‘P’ in “PA Galuxy”, was incorrect, the results were correct. This is one of the essential features of n-gram based matching – you don’t rely on the leading characters in the string to match.
On to the Python part of this interaction:
@app.route("/search/<q_base_64>/<int:limit>")
def do_search(q_base_64, limit):
query_str = decode(q_base_64)
logging.info("Query: {}".format(query_str))
ng = tokenize(query_str)
logging.info("Query (n-grams): {}".format(ng))
sql = """
WITH qbool AS
(
SELECT id, grams, 1 + ABS(ARRAY_LENGTH(grams, 1) - ARRAY_LENGTH(CAST(:ngrams AS TEXT[]), 1)) delta
FROM teams
WHERE grams && CAST(:ngrams AS TEXT[])
), qscore AS
(
SELECT id, COUNT(*) n FROM
(
SELECT id, UNNEST(grams) FROM qbool
INTERSECT
SELECT id, UNNEST(CAST(:ngrams AS TEXT[])) FROM qbool
)
GROUP BY id
)
SELECT qbool.id, t.name, 100*n/delta score
FROM qbool, qscore, teams t
WHERE qbool.id = qscore.id AND t.id = qbool.id
ORDER BY score DESC
LIMIT :max_rows;
"""
stmt = text(sql).bindparams(ngrams=ng, max_rows=limit)
rv = []
for row in run_statement(stmt, True, False):
pk = str(row[0])
name = str(row[1])
score = float(row[2]/len(ng))
d = {}
(d["pk"], d["name"], d["score"]) = (pk, name, '{:.4f}'.format(score))
rv.append(d)
return Response(json.dumps(rv), status=200, mimetype="application/json")
There’s a fair bit going on here, mostly within the SQL expression. I’ll do my best to narrate that:
There are two common table expressions (CTEs) which handle different aspects
qbool
handles the boolean nature: is this row a match or not? It also provides an additional scoring input, which is the difference in the length of the provided query string and the actual value in thegrams
column. This is used to, for example, boost the score for the row containing “LA Galaxy” relative to a row containing “LA Galaxy II”.The query predicate,
WHERE grams && CAST(:ngrams AS TEXT[])
, incorporates the&&
(array overlap) operator. Ideally, this operation would use the GIN index; this pull request addresses that, has been merged, so is currently available on main. It will be included in a future release – stay tuned.qscore
just uses the results ofqbool
to determine a score based on the number of overlapping n-grams between the matching row and the query.Finally, the results of these CTEs are combined with the
name
column from theteams
table to generate an ordered result to return to the client in JSON format (see the above example).
How to do fuzzy matching with CockroachDB
Here are some notes on how to replicate what I’ve shown above. Given I’ve already done this, it’s possible I’ll leave out some steps, so please let me know (GitHub issue is probably the simplest way). I’m going to illustrate this using an Ubuntu VM I’ve just deployed in Google Cloud Platform.
Step 1: Install some prereq’s (I like to use the
psql
CLI):
sudo apt update
sudo apt install postgresql-client-common -y
sudo apt install postgresql-client -y
sudo apt install python3-pip -y
sudo apt install libpq-dev -y
pip3 install -r requirements.txt
Step 2: Deploy the latest CockroachDB binary (I’ll use the beta since 22.1 hasn’t yet shipped):
$ curl https://binaries.cockroachdb.com/cockroach-v22.1.0-beta.1.linux-amd64.tgz | tar xzvf -
Step 3: Start “demo” mode (keep this terminal open to the CLI):
$ ./cockroach-v22.1.0-beta.1.linux-amd64/cockroach demo
In the output, you’ll see a line like the following, which you’ll use in a couple of contexts:
# (sql)
postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require
Step 4: In a new shell, clone this GitHub repo:
$ git clone https://github.com/mgoddard/hot-fuzz.git
Step 5: Make that repo your current working directory (the VM’s hostname is also
hot-fuzz
):
$ cd hot-fuzz/
$ ls
LICENSE README.md images prep_teams_data.pl pretty_print_json.py trigrams.py
Step 6: Start up the Python Flask REST app:
$ export DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ ./trigrams.py
Step 7: Back in the terminal with the CLI, run the DDL, enable and then set up the changefeed:
demo@127.0.0.1:26257/movr> CREATE TABLE teams
(
id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
, name TEXT NOT NULL
, grams TEXT[]
, FAMILY f1 (id, name)
, FAMILY f2 (grams)
);
CREATE INDEX ON teams USING GIN (grams);
demo@127.0.0.1:26257/movr> SET CLUSTER SETTING kv.rangefeed.enabled = TRUE;
demo@127.0.0.1:26257/movr> CREATE CHANGEFEED FOR TABLE teams family "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;
Step 8: In a third terminal, load the data (the Perl script is included in this repo):
$ cd hot-fuzz/
$ DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ curl -s https://en.wikipedia.org/wiki/List_of_professional_sports_teams_in_the_United_States_and_Canada | ./prep_teams_data.pl | psql $DB_CONN_STR
Step 9: Finally, try out that /search endpoint:
$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | ./pretty_print_json.py
[
{
"name": "LA Galaxy",
"pk": "73b541c1-e3d4-4f1d-8598-bedc11200f03",
"score": "42.8571"
},
{
"name": "LA Galaxy II",
"pk": "3a3b0e44-68cc-45da-aa72-9467e46f0458",
"score": "10.7143"
},
{
"name": "LA Galaxy II",
"pk": "9303e054-d6ad-4793-b904-9e3eeba28ff9",
"score": "10.7143"
},
{
"name": "LA Giltinis",
"pk": "9b8bfbb4-1ac6-477d-aaed-79daf1183f02",
"score": "4.7619"
},
{
"name": "Tampa Mayhem",
"pk": "f566f12f-7f8d-4ae6-b04b-832a5f9025c5",
"score": "3.5714"
}
]
real 0m0.049s
user 0m0.052s
sys 0m0.008s
That’s it!
If you’d like to
Acknowledgements
The author wishes to thank the following individuals for providing valuable input which made this blog post possible:
Dan Kelly, for mentioning the tweet which inspired the activity focused on n-grams
@schrepfler
, for tagging@CockroachDB
in the tweet about n-grams and fuzzy matchingAaron Zinger, for a heads up about the upcoming changefeed support for column families
Rebecca Taft, for pushing on the PR for index acceleration of the && operator
Rajiv Sharma, for providing that PR and for adding the final polish to it yesterday, so it could be merged
Jordan Lewis, for taking up the trigram cause in the core of CockroachDB itself