blog-banner

Integrating OLTP and OLAP systems: Enhanced decision making with CockroachDB, BigQueryML and Vertex AI

Last edited on April 9, 2024

0 minute read

    In the evolving landscape of database technology, the distinction between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) systems is fundamental. OLTP systems, designed for managing transactional data, prioritize speed and reliability in data handling. In contrast, OLAP systems are structured to facilitate complex queries and analytics, analyzing massive quantities of data and offering deep insights into data patterns over time.

    By integrating OLTP and OLAP systems and facilitating the immediate accessibility of transactional data, organizations can significantly enhance their capabilities around business intelligence, predictive analytics, and operational efficiency.

    CockroachDB and Google Cloud BigQuery stand out in their respective domains of OLTP and OLAP for their scalability, reliability, and performance. This blog explores how enterprises can leverage CockroachDB as an OLTP database and utilize BigQuery for OLAP and further enhance data capabilities using BigQueryML and Vertex AI for machine learning.

    Understanding CockroachDB for OLTPCopy Icon

    The Essence of OLTPCopy Icon

    Online Transaction Processing (OLTP) databases are at the heart of most mission critical business operations, handling everything from customer orders to real-time inventory management. These systems are optimized for high transaction volume, ensuring data integrity and speed in processing multiple concurrent transactions.

    Why CockroachDB?Copy Icon

    CockroachDB is a distributed SQL database designed for cloud services, offering a unique blend of traditional SQL database ease-of-use with the scalability and resilience of NoSQL systems. It excels in OLTP scenarios thanks to its:

    • Distributed Nature: Automatically replicates and distributes data across multiple nodes, ensuring high availability and fault tolerance.

    • Strong Consistency: Provides serializable isolation, the highest level of data consistency, ensuring accurate and reliable transaction processing.

    • Scalability: Effortlessly scales horizontally, allowing businesses to handle growing transaction volumes without compromising performance or consistency.

    • Operational Simplicity: Streamlines database management tasks like scaling, rebalancing, and recovery, significantly reducing operational overhead.

    • Survivability: CockroachDB is designed to survive node, availability zone and even cloud region failures with minimal latency disruption and no manual intervention. This level of resilience makes it exceptionally reliable for mission-critical applications.

    By leveraging CockroachDB for OLTP, businesses can ensure that their transactional systems are not only robust and efficient but also ready to meet future demands. You can read more about CockroachDB’s architecture here.

    Elevating Analytical Processing with BigQueryCopy Icon

    The Significance of OLAPCopy Icon

    Online Analytical Processing (OLAP) systems are designed for query-intensive data analysis, offering insights into historical data across various dimensions. Unlike OLTP systems that focus on fast transaction processing, OLAP is optimized for complex queries and analytics, enabling businesses to derive meaningful insights from their data.

    BigQuery as an OLAP SolutionCopy Icon

    BigQuery stands as a leading OLAP solution, delivering unparalleled query performance, scalability, and operational efficiency for enterprise-grade data warehousing and analytics initiatives. BigQuery’s serverless architecture allows analysts and data scientists to run queries on massive datasets without managing underlying infrastructure, making it an ideal platform for exploring and analyzing transactional data processed by OLTP systems like CockroachDB.

    Benefits of Using BigQueryCopy Icon

    • Speed and Scalability: BigQuery’s distributed architecture enables rapid query execution across petabytes of data, facilitating real-time analytics and decision-making.

    • Serverless: No infrastructure to manage means you can focus on analyzing data without worrying about database administration.

    • Integration Ecosystem: Seamless integration with other Google Cloud services, including AI and machine learning tools, enhances analytical capabilities.

    By leveraging BigQuery for OLAP, organizations can unlock deep insights into their data, driving better business decisions and strategies.

    Integrating CockroachDB with BigQuery via Pub/SubCopy Icon

    To harness the full power of transactional data for analytical processing, integrating CockroachDB with BigQuery is essential. This integration enables the seamless flow of data from the OLTP system (CockroachDB) to the OLAP system (BigQuery) for in-depth analysis.

    A Practical Use CaseCopy Icon

    01-diagram-cockroachdb-pubsub-biguery-vertexai

    In the example below, we will create a pipeline to replicate data from CockroachDB to BigQuery using Google Cloud Pub/Sub. The data we’ll be using is sales data from a fictional eCommerce clothing store called Atom Fashion. Atom Fashion is using CockroachDB as the OLTP database for its online sales. As new orders are placed, the sales data is replicated to BigQuery.

    Once the data is in BigQuery, the Atom Fashion Marketing team will take advantage of the BigQuery Machine Learning (BQML) integration with Vertex AI, Google Cloud’s comprehensive platform for AI, as well as Gemini Pro, Google’s most powerful multi-modal AI model. We’ll ask the model to generate customized emails to a specific subset of customers offering them a discount on future purchases based on their past purchase history.

    Performing this email marketing campaign manually could take days or even weeks. However, using the combined power of CockroachDB and Google Cloud, you should be able to walk through this tutorial and create a successful custom email marketing campaign based on your most recent sales data in under an hour!

    Leveraging BigQueryML & Vertex AI for Machine LearningCopy Icon

    With data seamlessly flowing from CockroachDB to BigQuery, leveraging BigQueryML can unlock powerful predictive insights and automate decision-making processes. BigQueryML enables users to create and execute machine learning models directly within Google BigQuery using SQL queries, simplifying the path from data analytics to machine learning.

    When it comes to elevating these capabilities, the integration between BigQueryML and Vertex AI, Google Cloud’s unified machine learning platform, takes it a step further. This integration provides an intuitive way for BigQueryML to handle communication with Vertex AI, thus providing a cohesive workflow for users. Vertex AI provides an end-to-end environment for building, deploying and scaling ML models while also offering tools that integrate smoothly with BigQueryML models.

    BigQueryML supports various machine learning models, including linear regression for forecasting, logistic regression for classification, k-means clustering for segmentation, etc. Once you have your foundational model trained, Vertex AI allows you to refine and optimize these models with its advanced MLOps tools. With Vertex AI, you can also experiment with more complex models, automate the ML lifecycle and deploy models to production with ease.

    Step-by-Step Integration GuideCopy Icon

    The instructions in this guide demonstrate how to build this integration using the Google Cloud command line interface. All of these steps can be executed using the Google Cloud UI as well.

    1. Create a new Google Cloud Project (This step is optional)

    If you’d like to keep your work from this tutorial separate from other work that you have in your Google Cloud Cloud Console, you may want to consider creating a separate Google Cloud Project for storing all of the objects that you’ll be creating in the following steps. The instructions for creating a new project can be found here. The examples in this document will use a project named cockroach-project. If you use a different project, please replace all references to cockroach-project in this document with your own project id.

    2. Create a CockroachDB Cluster on Google Cloud

    In this tutorial, you can set up a fully managed CockroachDB cluster on Google Cloud. You can choose between two deployment options, CockroachDB Dedicated or CockroachDB Serverless. To set up your CockroachDB Dedicated cluster on Google Cloud, you can follow these steps, or you can follow the steps here to set up a CockroachDB Serverless cluster on Google Cloud. Once you have your database cluster configured, you can move to the next step to load data onto your CockroachDB cluster.

    3. Load the sample data into CockroachDB

    This tutorial uses data from a fictional eCommerce Store. For detailed instructions on loading this data to CockroachDB, see Appendix A at the end of this document

    4. Enable Pub/Sub in CockroachDB

    The next step is to configure CockroachDB to publish changes to a Google Cloud Pub/Sub topic. This involves setting up change data capture (CDC) on the tables you want to analyze, ensuring that every insert, update, or delete is captured in real-time. Read the note below and then follow the instructions in the official Cockroach Labs documentation to set up Pub/Sub as the changefeed sync for CockroachDB.

    Note: For this tutorial, we will be making the following changes to the instructions specified in the previous paragraph:

    • Since we will be providing our own data source for this tutorial, Step #3 in the instructions above (running the CockroachDB Movr application to load data) is not necessary.

    • We’ll be replacing the Topic and Subscription names with Topic and Subscription names that more accurately reflect our data

    You’ll make these changes in Step #8 and Step #10

    5. Create a BigQuery Dataset

    In the following step, we will be creating a table in BigQuery to hold the data that will be replicated from CockroachDB. We’ll create a new dataset named cockroach_pubsub_data to keep the BigQuery objects that will be created in this tutorial organized.

    bq --location=US mk --dataset \ --default_table_expiration=0 \ cockroach-project:cockroach_pubsub_data

    6. Create a Table for the Pub/Sub Data

    Now that we have a dataset to keep us organized, we’ll create a table named order_items_data to hold the data that Pub/Sub will be writing to BigQuery. The table will contain rows from the CockroachDB table order_items as a single JSON object. We will add a few additional columns for storing Pub/Sub metadata.

    bq mk --table \ cockroach_pubsub_data.order_items_data \ subscription_name:STRING,\ message_id:STRING,\ publish_time:TIMESTAMP,\ data:STRING,\ attributes:STRING

    7. Grant permissions for the Pub/Sub service account

    The Pub/Sub Service Account needs to be granted permissions to interact with the new BigQuery Dataset

    gcloud projects add-iam-policy-binding cockroach-project \ --member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \ --role=roles/bigquery.dataEditor gcloud projects add-iam-policy-binding cockroach-project \ --member=serviceAccount:service-$(gcloud projects describe cockroach-project --format="value(projectNumber)")@gcp-sa-pubsub.iam.gserviceaccount.com \ --role=roles/bigquery.user

    8. Configure a BigQuery Subscription

    Create a subscription in Google Cloud Pub/Sub that consumes messages from the topic set up in Step #2 above (order_items_crdb). This subscription will write all of the changes made to the order_items table in CockroachDB directly into BigQuery.

    gcloud pubsub subscriptions create order_items_crdb_bq_sub \ --project=cockroach-project \ --topic=order_items_crdb \ --bigquery-table=cockroach-project:cockroach_pubsub_data:order_items_data \ --write-metadata

    9. Start the flow of data

    In order to simulate the flow of data into the CockroachDB table order_items, we’ll insert a subset of the data that is in the order_items_stage table. Inserting the data into the order_items table will result in the changefeed that you created on the order_items table flowing all of the newly inserted rows to the order_items_crdb Pub/Sub topic. The subscription that you created in the previous step, order_items_crdb_bq_sub, will take all of the data from the Pub/Sub topic and insert it into the order_items_data table in BigQuery.

    INSERT INTO public.order_items\ SELECT * FROM public.order_items_stage WHERE id >=1 AND id < 1001;

    As you continue your testing, you can run this statement multiple times using different ranges on the id column to simulate more order data flowing through the system.

    10. Running BQML with Gemini PRO on the Replicated Data

    Now that we have the changefeed operational and order data flowing into BigQuery, we can take advantage of BigQuery’s integrated machine learning functionality (BQML) and the BQML integration with Vertex AI to create a marketing campaign that utilizes the power of Gemini Pro and Generative AI.

    Prerequisite work:Copy Icon

    0. Open a notebook from the BigQuery console

    1. Create an external connection to Cloud Resources and name it crdb_bqml. An external connection allows BigQuery to call other GCP services, such as Vertex AI. Also install the jq package which will be used later for extracting JSON data.

    2. Use the +Code button to add a new code block to the notebook:

    03-bigquery-notebook-code-block

    # Double check the data is in the dataset  # Create a BigQuery Connection to use Gemini Pro, UPDATE THE PROJECT_ID !bq mk \ --connection \ --location=US \ --project_id=cockroach-project \ --connection_type=CLOUD_RESOURCE crdb_bqml # install jq for json extraction !apt install jq

    2. Get the email for the Account associated with the new connection:

    # from the output, copy the service account email !bq show --format=json --connection cockroach-project.us.crdb_bqml  | jq '.cloudResource.serviceAccountId'

    The output will look similar to this: “bqcx-7504964XXXX-5odo@gcp-sa-bigquery-condel.iam.gserviceaccount.com

    3. Using the Service Account Email from the previous step, assign the aiplatform.user role to the Service Account. This allows the Service Account to use Vertex AI services.

    # Update the SERVICE_ACCOUNT field !gcloud projects add-iam-policy-binding cockroach-project\ --member='serviceAccount:<SERVICE ACCOUNT EMAIL FROM THE PREVIOUS STEP>' \ --role='roles/aiplatform.user' \ --condition=None

    4. Create a model to allow BigQuery to use the connection that we created earlier for making calls Gemini Pro.

    # Make sure in the notebook, use %%bigquery for magic function in the cell %%bigquery CREATE OR REPLACE MODEL `cockroach-project.cockroach_pubsub_data.crdb_bq_model` REMOTE WITH CONNECTION `us.crdb_bqml` OPTIONS (ENDPOINT = 'gemini-pro');

    5. Run a Job to identify users that have purchased sunglasses and create an email template offering them 5% discount for a second pair. This job may take a couple of minutes to run.

    %%bigquery WITH item_order AS ( SELECT json_extract(data, "$.Value.after.product_id") as product_id, json_extract(data, "$.Value.after.user_id") as user_id FROM \`cockroach-project:cockroach_pubsub_data:order_items_data\` ), stg_pre_ai AS ( SELECT user_id, name as product_name, brand, first_name, last_name, FROM item_order JOIN \`bigquery-public-data.thelook_ecommerce.products\` AS product ON item_order.product_id = CAST(product.id AS STRING) JOIN \`bigquery-public-data.thelook_ecommerce.users\` AS users ON item_order.user_id = CAST(users.id AS STRING) GROUP BY 1,2,3,4,5 ) SELECT ml_generate_text_llm_result, product_name, prompt FROM ML.GENERATE_TEXT( MODEL `cockroach-project:cockroach_pubsub_data.crdb_bq_model`, ( SELECT CONCAT('''You are an expert marketer from Atom Fashion working on an engaging marketing campaign. You company uses informal tone and uses references to the beach. Your task is to generate an html email for a specific customer. Follow these steps: 1. Identify from the context if the customer has bought sunglasses, if they purchased sunglasses before generate an email, with the following steps. If they bought a different product, do not generate an email and return NO_MATCH. 2. If the customer purchased sunglasses, congratulate them for their previous purchase and offer a 5% discount (code SUPERFUN5) 3. If the customer has not purchased sunglasses, return an empty html. 4. The email should not have more than 500 tokens in length. 5. Return the code as plain html code 6. The email should begin with <html> and end with </html> Always think step by step. context:\nProduct Name:''', product_name,'''\nCustomer Name:''', first_name, '''\n HTML CODE: ''') AS prompt, product_name, first_name, last_name FROM stg_pre_ai LIMIT 200 ), STRUCT( 0.2 AS temperature, 500 AS max_output_tokens, true AS flatten_json_output ) ) WHERE ml_generate_text_llm_result like '%html%'

    6. Once your results have been returned (which may take a few minutes), press the “Prompt” button to see the results in an easier-to-read format.

    7. Your results should look similar to this:

    8. After copying the html document to a file and opening it with a web browser, the resulting email should look similar to this. Notice how the email is personalized in the style that we specified in the prompt. You’ll also see that the message content may be unique for each customer, while still following the guideline that we specified.

    Advantages of using CockroachDB, BigQueryML & Vertex AICopy Icon

    • Robust Data Management: CockroachDB offers distributed SQL capabilities, ensuring high availability and consistency for transactional data.

    • BigQuery DataFrames: Leveraging BigQuery dataframes enables efficient handling of large datasets, offering access to a rich set of API’s for analysis of data without having to move it out of BigQuery.

    • Seamless Analytics Integration: BigQueryML allows for direct machine learning model training within the data warehouse, streamlining the transition from analytics to insights.

    • Advanced Machine Learning Operations: Vertex AI provides comprehensive tools for refining, deploying, and managing machine learning models, enhancing AI-driven solutions.

    • Operational Efficiency: The integration optimizes data flow from transactional processing to analytical insights and machine learning application, improving overall business operations.

    • Strategic Insight and Agility: Together, these technologies equip organizations with the means to swiftly adapt and innovate, leveraging data for strategic advantage.

    ConclusionCopy Icon

    The strategic integration between CockroachDB, Pub/Sub, BigQuery and Vertex AI offers a powerful pathway for enterprises to extract actionable insights to make quick decisions from their transactional data. CockroachDB ensures robust, scalable and consistent transactional data management and its integration with the Google Cloud ecosystem sets the stage for deep analytical processing. This lays the foundation to leverage BigQueryML to initiate machine learning models within the data warehouse itself, only requiring SQL proficiency.

    Vertex AI extends this capability by offering tools to refine, deploy, scale machine learning models, thereby enhancing applications with predictive insights and AI driven functionality. This approach not only streamlines the journey from the data collected during the initial transaction to analytical insight but it also allows the enterprise to extract the full potential of their data. By adopting the capabilities of CockroachDB, BigQueryML and Vertex AI, enterprises can make a significant leap in operational agility and strategic foresight.

    Appendix A: Loading the sample dataCopy Icon

    If you’d like to try this example yourself, the data that is being used comes from the BigQuery public dataset bigquery-public-data.thelook_ecommerce\. The data from the table bigquery-public-data.thelook_ecommerce.order_items\ will be exported from BigQuery to a .csv file and then imported into CockroachDB.

    NOTE: The data in this table is regenerated daily, so expect some variation between the output in this document and the actual output that you receive.

    Step 1: Create a Google Cloud Storage Bucket

    Create a bucket in Google Cloud storage for the data that you will be exporting from the BigQuery Public Dataset.

    gcloud storage buckets create gs://cockroach-sample-data --project=cockroach-project --location=US

    Step 2: Export the data

    Export the data from the BigQuery Public Dataset into the bucket that you created in the previous step.

    bq extract --destination_format=CSV  bigquery-public-data:thelook_ecommerce.order_items gs://cockroach-sample-data/order_items.csv

    Step 3: Create a Service Account

    In order for CockroachDB to be able to access the CSV file in your Google Cloud bucket, you need to create a Service Account with sufficient permissions for CockroachDB to access the data.

    gcloud iam service-accounts create cockroach-bigquery-import --project cockroach-project

    Step 4: Add Permissions to the new Service Account

    Add the permissions required for Cockroach

    gsutil iam ch serviceAccount:cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com:objectViewer,legacyBucketReader gs://cockroach-sample-data

    Step 5: Create and Download a Service Account Key

    Create a JSON Key file for your service account. This key file is used by CockroachDB for accessing the data in your Google Cloud Storage bucket. Running this command will result in the JSON key file cockroach_biquery_export_key.json being automatically created and downloaded to the default Downloads directory on your local machine.

    gcloud iam service-accounts keys create cockroach_biquery_export_key.json --project=cockroach-project --iam-account=cockroach-bigquery-import@cockroach-project.iam.gserviceaccount.com

    Step 6: Create a staging table in CockroachDB

    Create a table to store the data that you’ll be importing by running the command in the SQL Shell or any other SQL editor you prefer.

    CREATE TABLE public.order_items_stage ( id INT8 NOT NULL, order_id INT8 NULL, user_id INT8 NULL, product_id INT8 NULL, inventory_item_id INT8 NULL, status STRING NULL, created_at TIMESTAMP NULL, shipped_at TIMESTAMP NULL, delivered_at TIMESTAMP NULL, returned_at TIMESTAMP NULL, sale_price FLOAT8 NULL, CONSTRAINT order_items_pkey PRIMARY KEY (id ASC) );

    Step 7: Base64 encode the key file

    From a terminal window on your local machine, base64 encode the key file that you downloaded in Step 5.

    On a Mac:

    cat cockroach_bigquery_export_key.json | base64

    On Linux: (to prevent the output lines from wrapping):

    cat cockroach_bigquery_export_key.json | base64 -w 0

    Step 8: Import the data

    Import the data from your Google Cloud storage bucket into the table that you created in step 6 and provide your base64 encoded key file for authentication. The base64 encoded data follows the CREDENTIALS= in the IMPORT statement. The import may take a few minutes and the table will be unavailable while the data is being imported.

    IMPORT INTO public.order_items_stage ( id, order_id, user_id, product_id, inventory_item_id, status, created_at, shipped_at, delivered_at, returned_at, sale_price) CSV DATA ('gs://cockroach-sample-data/order_items.csv?AUTH=specified&CREDENTIALS=ewog......p9Cg==') WITH skip = '1', DETACHED;

    Step 9: Create an empty table for replication

    Create an empty table with the same structure as the order_items_stage table. This table will be configured for Change Data Capture and Pub/Sub in a later step.

    CREATE TABLE public.order_items (LIKE public.order_items_stage INCLUDING ALL);

    google cloud
    gcp
    OLAP
    OLTP