This page walks you through a simple demonstration of how CockroachDB can store and query unstructured JSONB data from a third-party API, as well as how an inverted index can optimize your queries.

Step 1. Install prerequisites

  • Install the latest version of CockroachDB.
  • Install the latest version of Go: brew install go
  • Install the PostgreSQL driver: go get github.com/lib/pq

Step 2. Start a single-node cluster

For the purpose of this tutorial, you need only one CockroachDB node running in insecure mode:

copy
icon/buttons/copy
$ cockroach start \
--insecure \
--store=json-test \
--listen-addr=localhost:26257 \
--http-addr=localhost:8080

Step 3. Create a user

In a new terminal, as the root user, use the cockroach user command to create a new user, maxroach.

copy
icon/buttons/copy
$ cockroach user set maxroach --insecure --host=localhost:26257

Step 4. Create a database and grant privileges

As the root user, open the built-in SQL client:

copy
icon/buttons/copy
$ cockroach sql --insecure --host=localhost:26257

Next, create a database called jsonb_test:

copy
icon/buttons/copy
> CREATE DATABASE jsonb_test;

Set the database as the default:

copy
icon/buttons/copy
> SET DATABASE = jsonb_test;

Then grant privileges to the maxroach user:

copy
icon/buttons/copy
> GRANT ALL ON DATABASE jsonb_test TO maxroach;

Step 5. Create a table

Still in the SQL shell, create a table called programming:

copy
icon/buttons/copy
> CREATE TABLE programming (
    id UUID DEFAULT uuid_v4()::UUID PRIMARY KEY,
    posts JSONB
  );
copy
icon/buttons/copy
> SHOW CREATE programming;
+--------------+-------------------------------------------------+
|    Table     |                   CreateTable                   |
+--------------+-------------------------------------------------+
| programming  | CREATE TABLE programming (                      |
|              |     id UUID NOT NULL DEFAULT uuid_v4()::UUID,   |
|              |     posts JSON NULL,                            |
|              |     CONSTRAINT "primary" PRIMARY KEY (id ASC),  |
|              |     FAMILY "primary" (id, posts)                |
|              | )                                               |
+--------------+-------------------------------------------------+

Step 6. Run the code

Now that you have a database, user, and a table, let's run code to insert rows into the table.

The code queries the Reddit API for posts in /r/programming. The Reddit API only returns 25 results per page; however, each page returns an "after" string that tells you how to get the next page. Therefore, the program does the following in a loop:

  1. Makes a request to the API.
  2. Inserts the results into the table and grabs the "after" string.
  3. Uses the new "after" string as the basis for the next request.

Download the json-sample.go file, or create the file yourself and copy the code into it:

copy
icon/buttons/copy
package main

import (
    "database/sql"
    "fmt"
    "io/ioutil"
    "net/http"
    "time"

    _ "github.com/lib/pq"
)

func main() {
    db, err := sql.Open("postgres", "user=maxroach dbname=jsonb_test sslmode=disable port=26257")
    if err != nil {
        panic(err)
    }

    // The Reddit API wants us to tell it where to start from. The first request
    // we just say "null" to say "from the start", subsequent requests will use
    // the value received from the last call.
    after := "null"

    for i := 0; i < 300; i++ {
        after, err = makeReq(db, after)
        if err != nil {
            panic(err)
        }
        // Reddit limits to 30 requests per minute, so don't do any more than that.
        time.Sleep(2 * time.Second)
    }
}

func makeReq(db *sql.DB, after string) (string, error) {
    // First, make a request to reddit using the appropriate "after" string.
    client := &http.Client{}
    req, err := http.NewRequest("GET", fmt.Sprintf("https://www.reddit.com/r/programming.json?after=%s", after), nil)

    req.Header.Add("User-Agent", `Go`)

    resp, err := client.Do(req)
    if err != nil {
        return "", err
    }

    res, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return "", err
    }

    // We've gotten back our JSON from reddit, we can use a couple SQL tricks to
    // accomplish multiple things at once.
    // The JSON reddit returns looks like this:
    // {
    //   "data": {
    //     "children": [ ... ]
    //   },
    //   "after": ...
    // }
    // We structure our query so that we extract the `children` field, and then
    // expand that and insert each individual element into the database as a
    // separate row. We then return the "after" field so we know how to make the
    // next request.
    r, err := db.Query(`
        INSERT INTO jsonb_test.programming (posts)
        SELECT json_array_elements($1->'data'->'children')
        RETURNING $1->'data'->'after'`,
        string(res))
    if err != nil {
        return "", err
    }

    // Since we did a RETURNING, we need to grab the result of our query.
    r.Next()
    var newAfter string
    r.Scan(&newAfter)

    return newAfter, nil
}

In a new terminal window, navigate to your sample code file and run it:

copy
icon/buttons/copy
$ go run json-sample.go

The code queries the Reddit API for posts in /r/programming. The Reddit API only returns 25 results per page; however, each page returns an "after" string that tells you how to get the next page. Therefore, the program does the following in a loop:

  1. Makes a request to the API.
  2. Grabs the "after" string.
  3. Inserts the results into the table.
  4. Uses the new "after" string as the basis for the next request.

Download the json-sample.py file, or create the file yourself and copy the code into it:

copy
icon/buttons/copy
import json
import psycopg2
import requests
import time

conn = psycopg2.connect(database="jsonb_test", user="maxroach", host="localhost", port=26257)
conn.set_session(autocommit=True)
cur = conn.cursor()

# The Reddit API wants us to tell it where to start from. The first request
# we just say "null" to say "from the start"; subsequent requests will use
# the value received from the last call.
url = "https://www.reddit.com/r/programming.json"
after = {"after": "null"}

for n in range(300):
    # First, make a request to reddit using the appropriate "after" string.
    req = requests.get(url, params=after, headers={"User-Agent": "Python"})

    # Decode the JSON and set "after" for the next request.
    resp = req.json()
    after = {"after": str(resp['data']['after'])}

    # Convert the JSON to a string to send to the database.
    data = json.dumps(resp)

    # The JSON reddit returns looks like this:
    # {
    #   "data": {
    #     "children": [ ... ]
    #   },
    #   "after": ...
    # }
    # We structure our query so that we extract the `children` field, and then
    # expand that and insert each individual element into the database as a
    # separate row.
    cur.execute("""INSERT INTO jsonb_test.programming (posts)
            SELECT json_array_elements(%s->'data'->'children')""", (data,))

    # Reddit limits to 30 requests per minute, so don't do any more than that.
    time.sleep(2)

cur.close()
conn.close()

In a new terminal window, navigate to your sample code file and run it:

copy
icon/buttons/copy
$ python json-sample.py

The program will take awhile to finish, but you can start querying the data right away.

Step 7. Query the data

Back in the terminal where the SQL shell is running, verify that rows of data are being inserted into your table:

copy
icon/buttons/copy
> SELECT count(*) FROM programming;
+-------+
| count |
+-------+
|  1120 |
+-------+
copy
icon/buttons/copy
> SELECT count(*) FROM programming;
+-------+
| count |
+-------+
|  2400 |
+-------+

Now, retrieve all the current entries where the link is pointing to somewhere on GitHub:

copy
icon/buttons/copy
> SELECT id FROM programming \
WHERE posts @> '{"data": {"domain": "github.com"}}';
+--------------------------------------+
|                  id                  |
+--------------------------------------+
| 0036d489-3fe3-46ec-8219-2eaee151af4b |
| 00538c2f-592f-436a-866f-d69b58e842b6 |
| 00aff68c-3867-4dfe-82b3-2a27262d5059 |
| 00cc3d4d-a8dd-4c9a-a732-00ed40e542b0 |
| 00ecd1dd-4d22-4af6-ac1c-1f07f3eba42b |
| 012de443-c7bf-461a-b563-925d34d1f996 |
| 014c0ac8-4b4e-4283-9722-1dd6c780f7a6 |
| 017bfb8b-008e-4df2-90e4-61573e3a3f62 |
| 0271741e-3f2a-4311-b57f-a75e5cc49b61 |
| 02f31c61-66a7-41ba-854e-1ece0736f06b |
| 035f31a1-b695-46be-8b22-469e8e755a50 |
| 03bd9793-7b1b-4f55-8cdd-99d18d6cb3ea |
| 03e0b1b4-42c3-4121-bda9-65bcb22dcf72 |
| 0453bc77-4349-4136-9b02-3a6353ea155e |
...
+--------------------------------------+
(334 rows)

Time: 105.877736ms
Note:
Since you are querying live data, your results for this and the following steps may vary from the results documented in this tutorial.

Step 8. Create an inverted index to optimize performance

The query in the previous step took 105.877736ms. To optimize the performance of queries that filter on the JSONB column, let's create an inverted index on the column:

copy
icon/buttons/copy
> CREATE INVERTED INDEX ON programming(posts);

Step 9. Run the query again

Now that there is an inverted index, the same query will run much faster:

copy
icon/buttons/copy
> SELECT id FROM programming \
WHERE posts @> '{"data": {"domain": "github.com"}}';
(334 rows)

Time: 28.646769ms

Instead of 105.877736ms, the query now takes 28.646769ms.

What's next?

Explore other core CockroachDB benefits and features:

You may also want to learn more about the JSONB data type and inverted indexes.



Yes No