How do I bulk insert data into CockroachDB?

  • To bulk-insert data into an existing table, batch multiple rows in one multi-row INSERT statement and do not include the INSERT statements within a transaction. Experimentally determine the optimal batch size for your application by monitoring the performance for different batch sizes (10 rows, 100 rows, 1000 rows).

    Note:

    You can also use the IMPORT INTO statement to bulk-insert CSV data into an existing table.

  • To bulk-insert data into a new table, the IMPORT statement performs better than INSERT. IMPORT can also be used to migrate data from other databases like MySQL, Oracle, and Postgres.

How do I auto-generate unique row IDs in CockroachDB?

To auto-generate unique row IDs, use the UUID column with the gen_random_uuid() function as the default value:

copy
icon/buttons/copy
> CREATE TABLE users (
        id UUID NOT NULL DEFAULT gen_random_uuid(),
        city STRING NOT NULL,
        name STRING NULL,
        address STRING NULL,
        credit_card STRING NULL,
        CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
        FAMILY "primary" (id, city, name, address, credit_card)
);
copy
icon/buttons/copy
> INSERT INTO users (name, city) VALUES ('Petee', 'new york'), ('Eric', 'seattle'), ('Dan', 'seattle');
copy
icon/buttons/copy
> SELECT * FROM users;
                   id                  |   city   | name  | address | credit_card
+--------------------------------------+----------+-------+---------+-------------+
  cf8ee4e2-cd74-449a-b6e6-a0fb2017baa4 | new york | Petee | NULL    | NULL
  2382564e-702f-42d9-a139-b6df535ae00a | seattle  | Eric  | NULL    | NULL
  7d27e40b-263a-4891-b29b-d59135e55650 | seattle  | Dan   | NULL    | NULL
(3 rows)

Alternatively, you can use the BYTES column with the uuid_v4() function as the default value instead:

copy
icon/buttons/copy
> CREATE TABLE users2 (
        id BYTES DEFAULT uuid_v4(),
        city STRING NOT NULL,
        name STRING NULL,
        address STRING NULL,
        credit_card STRING NULL,
        CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
        FAMILY "primary" (id, city, name, address, credit_card)
);
copy
icon/buttons/copy
> INSERT INTO users2 (name, city) VALUES ('Anna', 'new york'), ('Jonah', 'seattle'), ('Terry', 'chicago');
copy
icon/buttons/copy
> SELECT * FROM users;
                        id                       |   city   | name  | address | credit_card
+------------------------------------------------+----------+-------+---------+-------------+
  4\244\277\323/\261M\007\213\275*\0060\346\025z | chicago  | Terry | NULL    | NULL
  \273*t=u.F\010\274f/}\313\332\373a             | new york | Anna  | NULL    | NULL
  \004\\\364nP\024L)\252\364\222r$\274O0         | seattle  | Jonah | NULL    | NULL
(3 rows)

In either case, generated IDs will be 128-bit, large enough for there to be virtually no chance of generating non-unique values. Also, once the table grows beyond a single key-value range (more than 512 MiB by default), new IDs will be scattered across all of the table's ranges and, therefore, likely across different nodes. This means that multiple nodes will share in the load.

This approach has the disadvantage of creating a primary key that may not be useful in a query directly, which can require a join with another table or a secondary index.

If it is important for generated IDs to be stored in the same key-value range, you can use an integer type with the unique_rowid() function as the default value, either explicitly or via the SERIAL pseudo-type:

copy
icon/buttons/copy
> CREATE TABLE users3 (
        id INT DEFAULT unique_rowid(),
        city STRING NOT NULL,
        name STRING NULL,
        address STRING NULL,
        credit_card STRING NULL,
        CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
        FAMILY "primary" (id, city, name, address, credit_card)
);
copy
icon/buttons/copy
> INSERT INTO users3 (name, city) VALUES ('Blake', 'chicago'), ('Hannah', 'seattle'), ('Bobby', 'seattle');
copy
icon/buttons/copy
> SELECT * FROM users3;
          id         |  city   |  name  | address | credit_card
+--------------------+---------+--------+---------+-------------+
  469048192112197633 | chicago | Blake  | NULL    | NULL
  469048192112263169 | seattle | Hannah | NULL    | NULL
  469048192112295937 | seattle | Bobby  | NULL    | NULL
(3 rows)

Upon insert or upsert, the unique_rowid() function generates a default value from the timestamp and ID of the node executing the insert. Such time-ordered values are likely to be globally unique except in cases where a very large number of IDs (100,000+) are generated per node per second. Also, there can be gaps and the order is not completely guaranteed.

How do I generate unique, slowly increasing sequential numbers in CockroachDB?

Sequential numbers can be generated in CockroachDB using the unique_rowid() built-in function or using SQL sequences. However, note the following considerations:

  • Unless you need roughly-ordered numbers, we recommend using UUID values instead. See the previous FAQ for details.
  • Sequences produce unique values. However, not all values are guaranteed to be produced (e.g., when a transaction is canceled after it consumes a value) and the values may be slightly reordered (e.g., when a transaction that consumes a lower sequence number commits after a transaction that consumes a higher number).
  • For maximum performance, avoid using sequences or unique_rowid() to generate row IDs or indexed columns. Values generated in these ways are logically close to each other and can cause contention on few data ranges during inserts. Instead, prefer UUID identifiers.
  • For performance reasons, we discourage indexing on sequential keys. If, however, you are working with a table that must be indexed on sequential keys, you should use hash-sharded indexes. Hash-sharded indexes distribute sequential traffic uniformly across ranges, eliminating single-range hotspots and improving write performance on sequentially-keyed indexes at a small cost to read performance.

What are the differences between UUID, sequences, and unique_rowid()?

Property UUID generated with uuid_v4() INT generated with unique_rowid() Sequences
Size 16 bytes 8 bytes 1 to 8 bytes
Ordering properties Unordered Highly time-ordered Highly time-ordered
Performance cost at generation Small, scalable Small, scalable Variable, can cause contention
Value distribution Uniformly distributed (128 bits) Contains time and space (node ID) components Dense, small values
Data locality Maximally distributed Values generated close in time are co-located Highly local
INSERT latency when used as key Small, insensitive to concurrency Small, but increases with concurrent INSERTs Higher
INSERT throughput when used as key Highest Limited by max throughput on 1 node Limited by max throughput on 1 node
Read throughput when used as key Highest (maximal parallelism) Limited Limited

How do I order writes to a table to closely follow time in CockroachDB?

Most use cases that ask for a strong time-based write ordering can be solved with other, more distribution-friendly solutions instead. For example, CockroachDB's time travel queries (AS OF SYSTEM TIME) support the following:

  • Paginating through all the changes to a table or dataset
  • Determining the order of changes to data over time
  • Determining the state of data at some point in the past
  • Determining the changes to data between two points of time

Consider also that the values generated by unique_rowid(), described in the previous FAQ entries, also provide an approximate time ordering.

However, if your application absolutely requires strong time-based write ordering, it is possible to create a strictly monotonic counter in CockroachDB that increases over time as follows:

  • Initially: CREATE TABLE cnt(val INT PRIMARY KEY); INSERT INTO cnt(val) VALUES(1);
  • In each transaction: INSERT INTO cnt(val) SELECT max(val)+1 FROM cnt RETURNING val;

This will cause INSERT transactions to conflict with each other and effectively force the transactions to commit one at a time throughout the cluster, which in turn guarantees the values generated in this way are strictly increasing over time without gaps. The caveat is that performance is severely limited as a result.

If you find yourself interested in this problem, please contact us and describe your situation. We would be glad to help you find alternative solutions and possibly extend CockroachDB to better match your needs.

How do I get the last ID/SERIAL value inserted into a table?

There’s no function in CockroachDB for returning last inserted values, but you can use the RETURNING clause of the INSERT statement.

For example, this is how you’d use RETURNING to return a value auto-generated via unique_rowid() or SERIAL:

copy
icon/buttons/copy
> CREATE TABLE users (id INT DEFAULT unique_rowid(), name STRING);
copy
icon/buttons/copy
> INSERT INTO users (name) VALUES ('mike') RETURNING id;

What is transaction contention?

Transaction contention occurs when transactions issued from multiple clients at the same time operate on the same data. This can cause transactions to wait on each other and decrease performance, like when many people try to check out with the same cashier at a store.

For more information about contention, see Understanding and Avoiding Transaction Contention.

Does CockroachDB support JOIN?

CockroachDB supports SQL joins. We are working to improve their execution performance.

When should I use interleaved tables?

Interleaving tables improves query performance by optimizing the key-value structure of closely related tables, attempting to keep data on the same key-value range if it's likely to be read and written together.

You're most likely to benefit from interleaved tables when:

Does CockroachDB support JSON or Protobuf datatypes?

Yes, the JSONB data type is supported.

How do I know which index CockroachDB will select for a query?

To see which indexes CockroachDB is using for a given query, you can use the EXPLAIN statement, which will print out the query plan, including any indexes that are being used:

copy
icon/buttons/copy
> EXPLAIN SELECT col1 FROM tbl1;

If you'd like to tell the query planner which index to use, you can do so via some special syntax for index hints:

copy
icon/buttons/copy
> SELECT col1 FROM tbl1@idx1;

How do I log SQL queries?

There are several ways to log SQL queries. The type of logging you use will depend on your requirements.

SQL audit logs

Warning:

This is an experimental feature. The interface and output are subject to change.

SQL audit logging is useful if you want to log all queries that are run against specific tables.

Cluster-wide execution logs

For production clusters, the best way to log all queries is to turn on the cluster-wide setting sql.trace.log_statement_execute:

copy
icon/buttons/copy
> SET CLUSTER SETTING sql.trace.log_statement_execute = true;

With this setting on, each node of the cluster writes all SQL queries it executes to a secondary cockroach-sql-exec log file. Use the symlink cockroach-sql-exec.log to open the most recent log. When you no longer need to log queries, you can turn the setting back off:

copy
icon/buttons/copy
> SET CLUSTER SETTING sql.trace.log_statement_execute = false;

Log files are written to CockroachDB's standard log directory.

Slow query logs

The sql.log.slow_query.latency_threshold cluster setting is used to log only queries whose service latency exceeds a specified threshold value (e.g., 100 milliseconds):

copy
icon/buttons/copy
> SET CLUSTER SETTING sql.log.slow_query.latency_threshold = '100ms';

Each node that serves as a gateway will then record slow SQL queries to a cockroach-sql-slow log file. Use the symlink cockroach-sql-slow.log to open the most recent log. For more details on logging slow queries, see Using the slow query log.

Note:

Setting sql.log.slow_query.latency_threshold to a non-zero value enables tracing on all queries, which impacts performance. After debugging, set the value back to 0s to disable the log.

Log files are written to CockroachDB's standard log directory.

Authentication logs

Warning:

This is an experimental feature. The interface and output are subject to change.

SQL client connections can be logged by turning on the server.auth_log.sql_connections.enabled cluster setting:

copy
icon/buttons/copy
> SET CLUSTER SETTING server.auth_log.sql_connections.enabled = true;

This will log connection established and connection terminated events to a cockroach-auth log file. Use the symlink cockroach-auth.log to open the most recent log.

Note:

In addition to SQL sessions, connection events can include SQL-based liveness probe attempts, as well as attempts to use the PostgreSQL cancel protocol.

This example log shows both types of connection events over a hostssl (TLS certificate over TCP) connection:

I200219 05:08:43.083907 5235 sql/pgwire/server.go:445  [n1,client=[::1]:34588] 22 received connection
I200219 05:08:44.171384 5235 sql/pgwire/server.go:453  [n1,client=[::1]:34588,hostssl] 26 disconnected; duration: 1.087489893s

Along with the above, SQL client authenticated sessions can be logged by turning on the server.auth_log.sql_sessions.enabled cluster setting:

copy
icon/buttons/copy
> SET CLUSTER SETTING server.auth_log.sql_sessions.enabled = true;

This logs authentication method selection, authentication method application, authentication method result, and session termination events to the cockroach-auth log file. Use the symlink cockroach-auth.log to open the most recent log.

This example log shows authentication success over a hostssl (TLS certificate over TCP) connection:

I200219 05:08:43.089501 5149 sql/pgwire/auth.go:327  [n1,client=[::1]:34588,hostssl,user=root] 23 connection matches HBA rule:
# TYPE DATABASE USER ADDRESS METHOD        OPTIONS
host   all      root all     cert-password
I200219 05:08:43.091045 5149 sql/pgwire/auth.go:327  [n1,client=[::1]:34588,hostssl,user=root] 24 authentication succeeded
I200219 05:08:44.169684 5235 sql/pgwire/conn.go:216  [n1,client=[::1]:34588,hostssl,user=root] 25 session terminated; duration: 1.080240961s

This example log shows authentication failure log over a local (password over Unix socket) connection:

I200219 05:02:18.148961 1037 sql/pgwire/auth.go:327  [n1,client,local,user=root] 17 connection matches HBA rule:
# TYPE DATABASE USER ADDRESS METHOD   OPTIONS
local  all      all          password
I200219 05:02:18.151644 1037 sql/pgwire/auth.go:327  [n1,client,local,user=root] 18 user has no password defined
I200219 05:02:18.152863 1037 sql/pgwire/auth.go:327  [n1,client,local,user=root] 19 authentication failed: password authentication failed for user root
I200219 05:02:18.154168 1036 sql/pgwire/conn.go:216  [n1,client,local,user=root] 20 session terminated; duration: 5.261538ms

For complete logging of client connections, we recommend enabling both server.auth_log.sql_connections.enabled and server.auth_log.sql_sessions.enabled.

Note:

Be aware that both logs perform one disk I/O per event and will impact performance when enabled.

For more details on authentication and certificates, see Authentication.

Log files are written to CockroachDB's standard log directory.

Per-node execution logs

Alternatively, if you are testing CockroachDB locally and want to log queries executed just by a specific node, you can either pass a CLI flag at node startup, or execute a SQL function on a running node.

Using the CLI to start a new node, pass the --vmodule flag to the cockroach start command. For example, to start a single node locally and log all client-generated SQL queries it executes, you'd run:

$ cockroach start --insecure --listen-addr=localhost --vmodule=exec_log=2 --join=<join addresses>
Tip:

To log CockroachDB-generated SQL queries as well, use --vmodule=exec_log=3.

From the SQL prompt on a running node, execute the crdb_internal.set_vmodule() function:

copy
icon/buttons/copy
> SELECT crdb_internal.set_vmodule('exec_log=2');

This will result in the following output:

  crdb_internal.set_vmodule
+---------------------------+
                          0
(1 row)

Once the logging is enabled, all client-generated SQL queries executed by the node will be written to the primary CockroachDB log file as follows:

I180402 19:12:28.112957 394661 sql/exec_log.go:173  [n1,client=127.0.0.1:50155,user=root] exec "psql" {} "SELECT version()" {} 0.795 1 ""

Does CockroachDB support a UUID type?

Yes. For more details, see UUID.

How does CockroachDB sort results when ORDER BY is not used?

When an ORDER BY clause is not used in a query, rows are processed or returned in a non-deterministic order. "Non-deterministic" means that the actual order can depend on the logical plan, the order of data on disk, the topology of the CockroachDB cluster, and is generally variable over time.

Why are my INT columns returned as strings in JavaScript?

In CockroachDB, all INTs are represented with 64 bits of precision, but JavaScript numbers only have 53 bits of precision. This means that large integers stored in CockroachDB are not exactly representable as JavaScript numbers. For example, JavaScript will round the integer 235191684988928001 to the nearest representable value, 235191684988928000. Notice that the last digit is different. This is particularly problematic when using the unique_rowid() function, since unique_rowid() nearly always returns integers that require more than 53 bits of precision to represent.

To avoid this loss of precision, Node's pg driver will, by default, return all CockroachDB INTs as strings.

copy
icon/buttons/copy
// Schema: CREATE TABLE users (id INT DEFAULT unique_rowid(), name STRING);
pgClient.query("SELECT id FROM users WHERE name = 'Roach' LIMIT 1", function(err, res) {
  var idString = res.rows[0].id;
  // idString === '235191684988928001'
  // typeof idString === 'string'
});

To perform another query using the value of idString, you can simply use idString directly, even where an INT type is expected. The string will automatically be coerced into a CockroachDB INT.

copy
icon/buttons/copy
pgClient.query("UPDATE users SET name = 'Ms. Roach' WHERE id = $1", [idString], function(err, res) {
  // All should be well!
});

If you instead need to perform arithmetic on INTs in JavaScript, you will need to use a big integer library like Long.js. Do not use the built-in parseInt function.

copy
icon/buttons/copy
parseInt(idString, 10) + 1; // WRONG: returns 235191684988928000
require('long').fromString(idString).add(1).toString(); // GOOD: returns '235191684988928002'

Can I use CockroachDB as a key-value store?

CockroachDB is a distributed SQL database built on a transactional and strongly-consistent key-value store. Although it is not possible to access the key-value store directly, you can mirror direct access using a "simple" table of two columns, with one set as the primary key:

> CREATE TABLE kv (k INT PRIMARY KEY, v BYTES);

When such a "simple" table has no indexes or foreign keys, INSERT/UPSERT/UPDATE/DELETE statements translate to key-value operations with minimal overhead (single digit percent slowdowns). For example, the following UPSERT to add or replace a row in the table would translate into a single key-value Put operation:

> UPSERT INTO kv VALUES (1, b'hello')

This SQL table approach also offers you a well-defined query language, a known transaction model, and the flexibility to add more columns to the table if the need arises.

Why are my deletes getting slower over time?

I need to delete a large amount of data. I'm iteratively deleting a certain number of rows using a DELETE statement with a LIMIT clause, but it's getting slower over time. Why?

CockroachDB relies on multi-version concurrency control (MVCC) to process concurrent requests while guaranteeing strong consistency. As such, when you delete a row, it is not immediately removed from disk. The MVCC values for the row will remain until the garbage collection period defined by the gc.ttlseconds variable in the applicable zone configuration has passed. By default, this period is 25 hours.

This means that with the default settings, each iteration of your DELETE statement must scan over all of the rows previously marked for deletion within the last 25 hours. This means that if you try to delete 10,000 rows 10 times within the same 25 hour period, the 10th command will have to scan over the 90,000 rows previously marked for deletion.

If you need to iteratively delete rows in constant time, you can alter your zone configuration and change gc.ttlseconds to a low value like 5 minutes (i.e., 300), and run your DELETE statement once per GC interval. We strongly recommend returning gc.ttlseconds to the default value after your large deletion is completed.

For instructions showing how to delete specific rows, see Delete specific rows.

See also



Yes No