Import Performance Best Practices

On this page

Warning:

As of November 18, 2022, CockroachDB v21.1 is no longer supported. For more details, refer to the Release Support Policy.

This page provides best practices for optimizing import performance in CockroachDB.

Import speed primarily depends on the amount of data that you want to import. However, there are two main factors that have can have a large impact on the amount of time it will take to run an import:

Splitting data
Import format

Note:

If the import size is small, then you do not need to do anything to optimize performance. In this case, the import should run quickly, regardless of the settings.

Split your data into multiple files

Splitting the import data into multiple files can have a large impact on the import performance. The following formats support multi-file import:

CSV
DELIMITED DATA
AVRO, when the schema is provided in-line

For these formats, we recommend splitting your data into as many files as there are nodes.

For example, if you have a 3-node cluster, split your data into 3 files and import:

> IMPORT TABLE customers (
        id UUID PRIMARY KEY,
        name TEXT,
        INDEX name_idx (name)
)
CSV DATA (
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv',
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers_2.csv',
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers_3.csv',
);

CockroachDB imports the files that you give it, and does not further split them. For example, if you import one large file for all of your data, CockroachDB will process that file on one node– even if you have more nodes available. However, if you import two files (and your cluster has at least two nodes), each node will process a file in parallel. This is why splitting your data into as many files as you have nodes will dramatically decrease the time it takes to import data.

Note:

If you split the data into more files than you have nodes, it will not have a large impact on performance.

Choose a performant import format

Import formats do not have the same performance because of the way they are processed. Below, import formats are listed from fastest to slowest:

CSV or DELIMITED DATA (both have about the same import performance)
AVRO
MYSQLDUMP
PGDUMP

We recommend formatting your import files as CSV, DELIMITED DATA, or AVRO. These formats can be processed in parallel by multiple threads, which increases performance.

However, MYSQLDUMP and PGDUMP run a single thread to parse their data, and therefore have substantially slower performance.

MYSQLDUMP and PGDUMP are two examples of "bundled" data. This means that the dump file contains both the table schema and the data to import. These formats are the slowest to import, with PGDUMP being the slower of the two. This is because CockroachDB has to first load the whole file, read the whole file to get the schema, create the table with that schema, and then import the data. While these formats are slow, there are a couple of things you can do to speed up a bundled data import:

Provide the table schema in-line
Import the schema separately from the data

Note:

As of v21.2, certain IMPORT TABLE statements that defined the table schema inline are deprecated. To import data into a new table, use CREATE TABLE followed by IMPORT INTO. For an example, read Import into a new table from a CSV file.

Provide the table schema in-line

When importing bundled data formats, it is often faster to provide schema for the imported table in-line. For example, instead of importing both the table schema and data from the same file:

> IMPORT TABLE employees
FROM PGDUMP
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/employees-full.sql' WITH ignore_unsupported_statements
;

You can dump the table data into a CSV file and provide the table schema in the statement:

> IMPORT TABLE employees (
        id UUID PRIMARY KEY,
        name STRING
)
CSV DATA (
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/employees-full.csv'
);

Tip:

If you need to import multiple tables, you can start multiple IMPORT jobs to import tables in parallel from the same import file.

Import the schema separately from the data

For single-table MYSQLDUMP or PGDUMP imports, split your dump data into two files:

A SQL file containing the table schema
A CSV file containing the table data

Then, import the schema-only file:

> IMPORT TABLE customers
FROM PGDUMP
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.sql' WITH ignore_unsupported_statements
;

And use the IMPORT INTO statement to import the CSV data into the newly created table:

> IMPORT INTO customers (id, name)
CSV DATA (
    'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv'
);

This method has the added benefit of alerting on potential issues with the import sooner; that is, you will not have to wait for the file to load both the schema and data just to find an error in the schema.

Import into a schema with secondary indexes

When importing data into a table with secondary indexes, the import job will ingest the table data and required secondary index data concurrently. This may result in a longer import time compared to a table without secondary indexes. However, this typically adds less time to the initial import than following it with a separate pass to add the indexes. As a result, importing tables with their secondary indexes is the default workflow, suitable for most import jobs.

However, in large imports, it may be preferable to remove the secondary indexes from the schema, perform the import, and then re-create the indexes separately. This provides increased visibility into its progress and ability to retry each step independently.

Cockroach
University

Docs Hub

Import Performance Best Practices

Split your data into multiple files

Choose a performant import format

Provide the table schema in-line

Import the schema separately from the data

Import into a schema with secondary indexes

See also

Cockroach University

Docs Hub

Cockroach University

Docs Hub

Import Performance Best Practices

Split your data into multiple files

Choose a performant import format

Provide the table schema in-line

Import the schema separately from the data

Import into a schema with secondary indexes

See also

Cockroach
University

Cockroach
University