This page provides best practices for optimizing import performance in CockroachDB.
Import speed primarily depends on the amount of data that you want to import. However, there are two main factors that have can have a large impact on the amount of time it will take to run an import:
If the import size is small, then you do not need to do anything to optimize performance. In this case, the import should run quickly, regardless of the settings.
Split your data into multiple files
Splitting the import data into multiple files can have a large impact on the import performance. The following formats support multi-file import:
AVRO, when the schema is provided in-line
For these formats, we recommend splitting your data into as many files as there are nodes.
For example, if you have a 3-node cluster, split your data into 3 files and import:
> IMPORT TABLE customers ( id UUID PRIMARY KEY, name TEXT, INDEX name_idx (name) ) CSV DATA ( 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv', 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers_2.csv', 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers_3.csv', );
CockroachDB imports the files that you give it, and does not further split them. For example, if you import one large file for all of your data, CockroachDB will process that file on one node– even if you have more nodes available. However, if you import two files (and your cluster has at least two nodes), each node will process a file in parallel. This is why splitting your data into as many files as you have nodes will dramatically decrease the time it takes to import data.
If you split the data into more files than you have nodes, it will not have a large impact on performance.
File storage during import
During migration, all of the features of
IMPORT that interact with external file storage assume that every node has the exact same view of that storage. In other words, in order to import from a file, every node needs to have the same access to that file.
Choose a performant import format
Import formats do not have the same performance because of the way they are processed. Below, import formats are listed from fastest to slowest:
DELIMITED DATA(both have about the same import performance)
We recommend formatting your import files as
DELIMITED DATA, or
AVRO. These formats can be processed in parallel by multiple threads, which increases performance.
PGDUMP run a single thread to parse their data, and therefore have substantially slower performance.
PGDUMP are two examples of "bundled" data. This means that the dump file contains both the table schema and the data to import. These formats are the slowest to import, with
PGDUMP being the slower of the two. This is because CockroachDB has to first load the whole file, read the whole file to get the schema, create the table with that schema, and then import the data. While these formats are slow, there are a couple of things you can do to speed up a bundled data import:
As of v21.2, certain
IMPORT TABLE statements that defined the table schema inline are deprecated. To import data into a new table, use
CREATE TABLE followed by
IMPORT INTO. For an example, read Import into a new table from a CSV file.
Provide the table schema in-line
When importing bundled data formats, it is often faster to provide schema for the imported table in-line. For example, instead of importing both the table schema and data from the same file:
> IMPORT TABLE employees FROM PGDUMP 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/employees-full.sql' WITH ignore_unsupported_statements ;
You can dump the table data into a CSV file and provide the table schema in the statement:
> IMPORT TABLE employees ( id UUID PRIMARY KEY, name STRING ) CSV DATA ( 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/employees-full.csv' );
If you need to import multiple tables, you can start multiple
IMPORT jobs to import tables in parallel from the same import file.
Import the schema separately from the data
PGDUMP imports, split your dump data into two files:
- A SQL file containing the table schema
- A CSV file containing the table data
Then, import the schema-only file:
> IMPORT TABLE customers FROM PGDUMP 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.sql' WITH ignore_unsupported_statements ;
And use the
IMPORT INTO statement to import the CSV data into the newly created table:
> IMPORT INTO customers (id, name) CSV DATA ( 'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv' );
This method has the added benefit of alerting on potential issues with the import sooner; that is, you will not have to wait for the file to load both the schema and data just to find an error in the schema.
Import into a schema with secondary indexes
When importing data into a table with secondary indexes, the import job will ingest the table data and required secondary index data concurrently. This may result in a longer import time compared to a table without secondary indexes. However, this typically adds less time to the initial import than following it with a separate pass to add the indexes. As a result, importing tables with their secondary indexes is the default workflow, suitable for most import jobs.
However, in large imports, it may be preferable to remove the secondary indexes from the schema, perform the import, and then re-create the indexes separately. This provides increased visibility into its progress and ability to retry each step independently.
Data type sizes
Above a certain size, many data types such as
JSONB may run into performance issues due to write amplification. See each data type's documentation for its recommended size limits.