blog-banner

Using lateral joins for business analytics in CockroachDB 20.1

Last edited on July 16, 2020

0 minute read

    CockroachDB’s 20.1 release supports lateral joins, which can dramatically reduce the lines of code needed to arrive at key business analytics. Today, I’m going to walk you through a demo of lateral joins and showcase how you might use them to run analytic queries directly in CockroachDB.

    What’s a lateral join?Copy Icon

    A lateral join is a type of correlated subquery that allows data sources to reference columns provided by previous data sources in the same query. In practice, this can be used to iterate over each row of a result, similar to a nested foreach loop. We will walk through an example below using CockroachDB 20.1 and our fictional ride-sharing company, MovR.

    Cluster SetupCopy Icon

    First, let’s set up CockroachDB 20.1 using your preferred method including Linux, Mac, or Windows binaries, Docker images, or building directly from source.

    This example will focus on the MovR rides table that is defined as:

    CREATE TABLE rides (                   id UUID NOT NULL,                   city VARCHAR NOT NULL,                   vehicle_city VARCHAR NULL,                   rider_id UUID NULL,                   vehicle_id UUID NULL,                   start_address VARCHAR NULL,                   end_address VARCHAR NULL,                   start_time TIMESTAMP NULL,                   end_time TIMESTAMP NULL,                   revenue DECIMAL(10,2) NULL                                        )

    Note, I’ve simplified this table a bit in case you want to follow along and copy-paste this table and the subsequent queries into CockroachDB. You can also use the new CockroachDB demo feature that we recently blogged about to try this out in less than five minutes.

    To use CockroachDB demo (and the more complex rides table with pre-populated data) enter the following command:

    ./cockroach demo

    This command will automatically load the MovR tables and data:

    show tables; table_name +----------------------------+ promo_codes rides user_promo_codes users vehicle_location_histories vehicles (6 rows)

    MovR OKRS: A Lateral Join ExampleCopy Icon

    The executive team at MovR has set a number of important Objectives and Key Results (OKRs) for MovR. The top objective for this year is to increase riders who use MovR. To do that, the executive team determined that the most important key result is to lower the average time between customers’ first ride and their second ride. This is an important metric because it helps customers build a habit of using MovR and increases the lifetime value of the customer, a key input into profitability.

    MovR uses CockroachDB as its OLTP system of record and it keeps track of every ride for each user. While we can export any data from CockroachDB via change data capture (CDC) into an analytics oriented database, in this case, we don’t need to as we can derive this answer directly in CockroachDB using SQL’s Lateral Join syntax.

    Eventually, we will have a fairly complex 25 line query. Before we get to the full query, let’s start in the middle and work our way out. First, we need to calculate the first time a user took a ride with MovR:

    SELECT     rider_id, 1 AS first_ride,     min(start_time) AS first_ride_time   FROM rides   GROUP BY rider_id

    This query finds the minimum start time of a ride for a given rider_id and alias it to the variable first_ride_time. It also keeps track of whether or not a user has completed a first ride by entering either 1 or null into a first_ride column.

    Next, we need to determine if the customer took as second ride, and, if so, determine how long after the first ride that the customer took their second ride:

    SELECT   1 AS second_ride,   start_time AS second_ride_time FROM rides WHERE   rider_id = r1.rider_id AND  start_time > first_ride_time ORDER BY start_time LIMIT 1

    Similarly to above, we are calculating a second_ride_time variable based on start_time. However, this time we require the start time to be greater than the variable we previously defined as first_ride_time. This ensures that we get our second ride time.

    Altogether, we can pair these queries into our larger query:

    SELECT  rider_id,  first_ride,  first_ride_time,  second_ride,  second_ride_time FROM (     --get the first ride time  SELECT    rider_id,     1 AS first_ride,     min(start_time) AS first_ride_time   FROM rides   GROUP BY rider_id ) r1 LEFT JOIN LATERAL (     -- Get the second ride time SELECT   1 AS second_ride,   start_time AS second_ride_time FROM rides WHERE   rider_id = r1.rider_id AND  start_time > first_ride_time ORDER BY start_time LIMIT 1 ) r2 ON true

    This will produce a table that looks like:

                rider_id    | first_ride | first_ride_time          | second_ride | second_ride_time +--------------------------------------+------------+----------------------------------+-------------+----------------------------------+   04553af9-dc34-461a-95fa-8c9e7123a613 |          1 | 2019-11-13 16:10:00.627167+00:00 | 1 | 2019-11-13 16:12:59.356821+00:00   77cacc32-57dc-4f52-83a4-d50a9a65b8bd |          1 | 2019-11-13 16:06:33.618505+00:00 | 1 | 2019-11-13 16:06:37.002664+00:00   c2c25f39-d2a0-480f-a5bb-36ec5436e34f |          1 | 2019-11-13 16:35:06.696277+00:00 | NULL | NULL   418c9765-4a04-4728-9f5d-981d3a6e9c5d |          1 | 2019-11-13 16:06:28.37234+00:00 | 1 | 2019-11-13 16:19:33.451015+00:00   4be831d4-9115-412b-b2c4-dceb92c81e10 |          1 | 2019-11-13 16:15:32.557127+00:00 | 1 | 2019-11-13 16:20:33.258865+00:00

    This is great, but we can do even better by directly making the comparison within CockroachDB using the avg comparison on the two defined variables: first_ride_time and second_ride_time.

    SELECT   avg(second_ride_time::DECIMAL - first_ride_time::DECIMAL) as average_time_to_second_ride FROM (     --get the first ride time  SELECT    rider_id,     1 AS first_ride,     min(start_time) AS first_ride_time   FROM rides   GROUP BY rider_id ) r1 LEFT JOIN LATERAL (     -- Get the second ride time SELECT   1 AS second_ride,   start_time AS second_ride_time FROM rides WHERE   rider_id = r1.rider_id AND  start_time > first_ride_time ORDER BY start_time LIMIT 1 ) r2 ON true This will produce a result like:  average_time_to_second_ride +-----------------------------+        38690.915436623268698 (1 row)   Time: 16.576ms

    *Note, if you are following along at home you may see a different final result because we generate random data in the MovR application.

    It this case it takes about 10 hours for our riders to complete their second ride. Now we have actionable data as a baseline for our OKR. We can measure the success of our efforts to lower the time to second ride as a key result moving forward. It may also begin to clue us in on a specific pattern--MovR customers may be using MovR to commute to and from work, a time of approximately 10 hours on average.

    Try Lateral Joins in CockroachDB 20.1Copy Icon

    CockroachDB’s 20.1 release now supports lateral joins, which can dramatically reduce the lines of code needed to arrive at key business analytics. It opens up a whole new world of possibilities for application developers to enhance the power of correlated subqueries. As we show above, a LEFT JOIN to a LATERAL subquery can provide the source rows in the result even if the LATERAL subquery produces no rows for them. To try out lateral joins for yourself, test out CockroachDB 20.1!

    distributed SQL
    join
    get started
    alpha
    engineering