How to send data to Azure Synapse with change data capture

How to send data to Azure Synapse with change data capture
[ Guides ]

CockroachDB: The Definitive Guide

We wrote the book on distributed scale. Literally.

Free O'Reilly Book

Modern data architectures contain a multitude of tools including something for transforming, shaping, and analyzing data to help inform better business decisions. For a mission-critical database like CockroachDB it’s important to be able to integrate with such systems to help customer teams leverage data for business. Combing a tool like CockroachDB with analytics services gives businesses transactional consistency and real-time data for making business decisions. 

In this tutorial we will look at how to send data using change data capture in CockroachDB to Azure Synapse. 

What is Azure Synapse?

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage, and serve data for immediate self service BI and machine learning needs. 

If you would like to follow along these are the prerequisites:

Send data from CockroachDB to Azure Synapse with change data capture

Step 1: Prepare your variables

To standardize the code in further steps a number of variables should be set.

loc1="uksouth"
clus1="crdb-aks-uksouth"
rg="mb-crdb-aks-multi-region"
loc1_cdc_storage_account="crdbuksouthcdc"
loc1_cdc_storagecontainer="uksouth-cdc"

Step 2: Create an Azure Storage Account and Container

CockroachDB’s Change Data Capture feature has the ability to send row level changes to object storage, in this case Azure blob storage. Create a storage account first.

az storage account create \
  --name $loc1_cdc_storage_account \
  --resource-group $rg \
  --location $loc1 \
  --sku Standard_RAGRS \
  --kind StorageV2

Then create a storage container within that storage account.

az storage container create \
 -n $loc1_cdc_storagecontainer \
 --fail-on-exist \
 --account-name $loc1_cdc_storage_account

Now we have a destination to send the data to from CockroachDB.

Step 3: Upgrade Azure Blob Storage with Azure Data Lake Storage Gen2 capabilities 

To take advantage of Synapse the storage account needs to be upgraded with Azure Data Lake Storage Gen2 capabilities. Run the command below to upgrade and validate the storage account.

az storage account hns-migration start --type validation -n $loc1_cdc_storage_account -g $rg

If the validation succeeds, the process completes and no errors appear.

If validation fails, a validation error will appear in the console. For example, the error (IncompatibleValuesForAccountProperties) Values for account properties are incompatible: Versioning Enabled indicates that an incompatible feature (Versioning) is enabled on the account. In this case, you would disable the feature and then start the validation process again.

In some cases, the path to a file named error.json appears in the console. You can open that file to determine why the account did not pass the validation step.

The following JSON indicates that an incompatible feature is enabled on the account. In this case, you would disable the feature and then start the validation process again.

Example Output:

{
 "startTime": "2021-08-04T18:40:31.8465320Z",
 "id": "45c84a6d-6746-4142-8130-5ae9cfe013a0",
 "incompatibleFeatures": [
     "Blob Delete Retention Enabled"
 ],
 "blobValidationErrors": [],
 "scannedBlobCount": 0,
 "invalidBlobCount": 0,
 "endTime": "2021-08-04T18:40:34.9371480Z"
}

After your account has been successfully validated, start the upgrade by running the following command.

az storage account hns-migration start --type upgrade -n $loc1_cdc_storage_account -g $rg

Step 4: Enable Change Data Capture

Change data capture (CDC) provides efficient, distributed, row-level changes into a configurable sink for downstream processing such as reporting, caching, or full-text indexing. In this particular case, we will be configuring the changefeed (CDC) to send data to the above configured storage account in Azure. This storage account has now been configured as a data lake.

Retrieve the storage account keys from Azure

In order to configure CDC we need to obtain the account keys for our storage account which will allow us to authenticate to it.

Run the following command to display your account key.

az storage account keys list -g $rg -n $loc1_cdc_storage_account --query "[0].value" -o tsv

The output will look like the example below. As it contains some unsupported characters to use within a URL we need to URL-Encode the key for a future step. There are a number of ways you can URL-Encode a string, but try this url encoder tool.

Example Output:

fot2mB/dHfvPm1PwDvjToArXxQcgDMITR8LQ3fDumqZVLpSUhY7UA7gDuRCsXVk88SMcikpd1AHj+ASt0jhDQw==

Below is an example of the URL Encoded account.

fot2mB%2FdHfvPm1PwDvjToArXxQcgDMITR8LQ3fDumqZVLpSUhY7UA7gDuRCsXVk88SMcikpd1AHj%2BASt0jhDQw%3D%3D

Configure CockroachDB CDC in Kubernetes

In this example, CockroachDB is running in Azure Kubernetes Service. If you would like the instructions on how to build a multi region AKS solution to run CockroachDB then please take a look at this repo.

To connect to the cluster we deploy a pod running the CockroachDB binary with the required certificate. First, create the pod using the command below.

kubectl create -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/multiregion/client-secure.yaml --namespace $loc1

Now the pod is running, connect into the pod running the cockroach sql command.

kubectl exec -it cockroachdb-client-secure -n $loc1 -- ./cockroach sql --certs-dir=/cockroach-certs --host=cockroachdb-public

Now you are connected to the pod with the SQL Client we can configure CDC. First, update the cluster settings kv.rangefeed.enable to be true.

SET CLUSTER SETTING kv.rangefeed.enabled = true;

Now that we have enabled the feature we can create our first CockroachDB changefeed. This changefeed will send all changes to the rides table to an Azure blob which is also an Azure Data Lake.

CREATE CHANGEFEED FOR TABLE rides
    INTO 
'azure://uksouth-cdc?AZURE_ACCOUNT_NAME=crdbuksouthcdc&AZURE_ACCOUNT_KEY=fot2mB%2FdHfvPm1PwDvjToArXxQcgDMITR8LQ3fDumqZVLpSUhY7UA7gDuRCsXVk88SMcikpd1AHj%2BASt0jhDQw%3D%3D'
    WITH
        updated;

To generate some changes to the rides table we are for a second time going to exec into the secure pod we created earlier. However, this time we are going to use cockroach workload movr to perform INSERT, UPDATE, and DELETE actions against the table. Remote into the pods with the command below.

kubectl exec -it cockroachdb-client-secure -n $loc1 --context $clus1 -- sh

The command below will create the schema required for the workload. In this example there is already an SQL user called craig with the password cockroach and a new database called movr. You will need to create this database and user or change the credentials for a user in your CockroachDB cluster.

cockroach workload init movr 'postgresql://craig:cockroach@cockroachdb-public:26257/movr?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt'

Once the database and schema are created we can generate some load. The command below will do this.

cockroach workload run movr --tolerate-errors --duration=99999m 'postgresql://craig:cockroach@cockroachdb-public:26257/movr?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt'

Step 5: Run a query against Azure Data Lake Gen2 with Synapse

In the Azure Portal create a Synapse workspace. In the search bar at the top, search for Synapse click on the add button and complete wizard that is presented.

Run a query against Azure Data Lake Gen2 with Synapse

Once created, Open Synapse Studio from the Getting Started tile in the main window.

Open Azure Synapse

Once opened it needs to be connected to external storage. This can be done by clicking in Data on the left hand menu and by using the +. Select Connect to external data and then complete the wizard to connect the Azure Data Lake.

Connect synapse to external storage

Synapse is now configured and we can run an example SQL Query. The query below looks at all the available files in the Data Lake and gives us a revenue total per City.

SELECT
    /* --> place the keys that you see in JSON documents in the WITH clause:
    , JSON_VALUE (line, '$.key1') AS header1
    , JSON_VALUE (line, '$.key2') AS header2
    */
    JSON_VALUE (line, '$.after."city"') as City,
    Sum(CAST(JSON_VALUE (line, '$.after."revenue"') as float)) as Total_Revenue
FROM
    OPENROWSET(
        BULK 'https://crdbuksouthcdc.dfs.core.windows.net/uksouth-cdc/2022-12-12/*.ndjson',
        FORMAT = 'csv',
        -- FIELDQUOTE and FIELDTERMINATOR are set to 0x0b as we do not expect to find it in the file.
        FIELDQUOTE = '0x0b',
        FIELDTERMINATOR ='0x0b'
    )
    WITH (
        line varchar(max)
    ) AS [result]
Group by JSON_VALUE (line, '$.after."city"')

You are able to see the totals in the image below represented by a bar chart.

Bar chart in Azure Synapse

Transactional consistency and real-time data

In conclusion, combining analytical tools like Azure Synapse with transactional databases like CockroachDB can have a positive impact for companies that need transactional consistency and want to make data-driven choices in real-time. Real-time data replication is made possible via CockroachDB’s Change Data Capture (CDC) capability, making it simple to interface with analytical software like Azure Synapse. The scalability, resilience, and consistency properties of CockroachDB help ensure that the data is correct and dependable despite heavy loads and failures. CockroachDB’s data locality feature also facilitates global deployments and low-latency data access. Generally speaking, firms can gain a competitive edge by using data-driven decisions that combine CockroachDB with analytical tools like Azure Synapse.

To learn more about change data capture in CockroachDB check out the following resources:

• [Blog] How to cut data streaming export costs by 40%

• [Blog] Exporting data with changefeeds

• [Video] CDC Transformations: How to protect user data

Keep Reading

8 ways to use CDC Queries for more powerful data streaming

CDC Queries are SQL-like statements that allow you to (1) filter (2) transform (3) and choose the schema of your data …

Read more
CockroachDB locality-aware Backups for Azure Blob

Any enterprise-grade database needs to have a mature story when it comes to Backup and Recovery. Historically, due to …

Read more
How to build a serverless polling application

Very few traditions can melt the corporate ice as well as an ugly holiday sweater contest. Which is why, this past …

Read more