Memory usage in CockroachDB

In this blog post, we provide some details on how CockroachDB uses system memory on each node, and what you can do to keep memory usage in CockroachDB under control.

Overview

To understand memory usage in CockroachDB, and specifically within a CockroachDB node, it is perhaps useful to imagine memory like a giant cake that is being split up in pieces and distributed to “eat” CockroachDB’s various components.

There are three main cake eaters in CockroachDB; in approximately decreasing order of appetite:

Cache memory. The biggest part of this is RocksDB’s cache. CockroachDB is based off RocksDB’s KV database, and RocksDB allocates a very large chunk of system memory to serve as a shared cache between all KV operations performed by a node.
Memory used at rest. This is composed of:
- Replica metadata. Each store holds in memory the metadata for the replicas available in that store.
- Metadata for other nodes. Each node maintains data about every other peer in the cluster.
Memory used by activity. Mainly:
- Store processing structures. Each store maintains multiple operation queues of variable sizes; under load these contain data for ongoing transactions.
- SQL data structures. For every client, the node will instantiate many data structures in memory to support the execution of SQL statements.

These consumers share in common that the amount of memory they need varies depending on run-time parameters. (There are other fixed-size usages of memory in CockroachDB, but it is negligible in comparison with those above.)

RocksDB’s cache is pretty straightforward: it will consume up to the --cache-size parameter’s value worth of memory, by default 25% of total system memory, and no more.

The node metadata, in comparison with the other consumers, is negligible, even for clusters with dozens of nodes.

The other items – metadata for replicas and memory used for activity – need more attention and are the subject of this blog post.

Memory usage by replica metadata

On each node, store memory usage increases mainly with the number of stores times the number of replicas per store.

How do database parameters and status influence per-store memory consumption?

If your database grows larger, the number of ranges increases. Since CockroachDB tries to distribute data across nodes (assuming no further zone configuration here), the number of replicas at each nodes will also increase, roughly uniformly.
If you change the replication factor, the number of replicas increases or decreases accordingly.
When you change the number of stores, all other parameters kept equal, replicas and queue load will be spread across the available stores. So if you configure more stores, there will be more queues but the number of replicas and queue entries per store will decrease overall. So overall memory usage per node may change a little but not significantly.

Therefore we can say that memory usage related to stores at rest evolves approximately with the following formula:

k1 × (DataSize × ReplicationFactor) / NumNodes + other

Where k1 is the metadata size per replica; and “other” is the rest of the memory usage per store, which is negligible in comparison. From a complexity theory perspective, we say that store memory usage is:

O( (1 / NumNodes) × ( DataSize × ReplicationFactor) ).

The main variables in this equation are available from CockroachDB’s admin UI: the number of nodes and replicas per node.

Although the precise value of the parameter k1 is not directly available, an operator can estimate the overall memory incurred by stores. This can be done simply by first disabling SQL clients temporarily from one of the cluster nodes (or alternatively, by adding a new node without clients and letting the replicas migrate to the new node), ensure that distributed SQL queries are not run on the isolated node, then observe how much system memory is needed for the node, and subtract the RocksDB cache size from this. The result gives the overall cost of replica metadata.

Another characteristic of the at-rest memory usage incurred by stores is that it evolves slowly and predictably: the overall database size is typically an application-level metric that closely follows business growth, and the number of nodes is is typically changed under operator supervision.

Finally, note how how the formula above has the number of nodes as divisor: adding more nodes, keeping all parameters equal, will decrease memory usage per node. This property can be used advantageously when running off cloud resources, where memory is often not priced linearly: adding 10 new nodes with 2GB of RAM each is usually cheaper than switching five existing nodes from 2GB to 6GB each. Scale can lower your costs!

Memory usage incurred by Store activity

Under load, each node receives requests to update stores locally and across the network. These requests are queued until they are processed. Bursts of activity cause data to pile up in queues.

How do database parameters and status influence this memory consumption?

If you increase the load on the cluster (more active transactions overall) the number of in-flight operations stored in the queues increases, although of course the activity is also spread across all nodes.
Memory usage in queues increases with activity but decreases as nodes process the queued operations. With faster CPU/disk speeds, the overall memory usage due to queued backlog become lower.
When you change the number of stores, all other parameters kept equal, queue load will be spread across the available stores. So if you configure more stores, there will be more queues but the number of queue entries per store will decrease overall. So overall memory usage per node related to store activity may change a little but not significantly.

Therefore we can say that memory usage related to store activity evolves approximately with the following formula:

k2 × NumInFlightOperations / (NumNodes × NodeSpeed) + other

Where k2 is the memory costs incurred per data accumulated in some queues per active transaction; and “other” is the rest of the memory usage due to activity load, which is negligible in comparison. From a complexity theory perspective, we say that memory usage related to store activity is:

O( (1 / NumNodes) × ( NumInFlightOperations / NodeSpeed) ).

Again the admin UI provides some insight by showing the number of SQL clients and queries executed, which in a production database is likely to correspond more or less with the load on store.

This usage is less predictable than at-rest memory usage. Although the number of nodes is a known and relatively stable value, the number of in-flight operations can evolve unpredictable. For example, a sudden increase in contention (e.g. a single post going viral, leading to contention on transactions updating a hit counter) could cause large spikes in the number of in-flight operations.

This said, notice how the processing speed of nodes is also the divisor in this formula. As long as the nodes running CockroachDB are fast enough (this includes both CPU and disk speeds), the stores will be able to absorb spikes in activity more effectively. In a way, hardware performance dampens the unpredictability of store activity.

Memory usage incurred by SQL activity

Each SQL client in CockroachDB that issues queries or statements will incur memory allocations on (at least) the node where the query was received.

There are three overarching concepts that you need to understand before analyzing SQL memory usage further.

The first is that SQL-related memory usage in CockroachDB is organized around the notion of a “session”. Sessions are not only the logical span of time between the point the SQL client connects to the point it disconnects; they also determine the lifetime of SQL-related memory allocations: most SQL allocations are scoped to the statement or transaction; only a small amount persists to the end of the session, and in any case no later than that.
The second is that sessions are created both by external client applications connecting to CockroachDB, and by CockroachDB components internally. For example, when a node needs to log some error message, the error message is also saved to the database using SQL. To do this, the node creates an “internal session” and issues SQL statements to itself. Henceforth, we will say that there are multiple channels that open and maintain SQL sessions in a CockroachDB node; some externally visible (external clients, admin UI) and some invisible (internal components).
Understanding which channels exist in CockroachDB helps answering the question “Where are sessions coming from?” and thus “How many sessions are open on this node?”
Finally, CockroachDB’s architecture is layered. Approximately, clients connect to the SQL API, which is based on services from the KV API, which relies on the distributed store API, which itself relies on the store API. Each of these abstraction layers will allocate data structures at its level on behalf of the SQL session. Therefore, beyond first-order memory costs that are easily understandable by casual SQL users, like in-memory tables, there are also indirect costs to each session made by the layers below SQL. Understanding indirect costs helps answering the question “How much memory is allocated on behalf of a single session?”

Once you grok these three concepts, you can understand the ground formula that determines SQL-related memory usage in CockroachDB:

NumSessions × ( SQL-level memory usage + indirect costs)

NumSessions depends on activity over the various channels. We detail these variables in the following sub-sections.

Channels: where SQL sessions appear

The most obvious channel through which SQL enters CockroachDB is client connections using the PostgreSQL wire protocol (“pgwire”).

In addition to this “main” channel, sessions are also created via:

External channels:
- Uses of the Admin RPC over HTTP. This is mainly used by CockroachDB’s Admin UI. There is at most one active session per RPC call.
- Built-in control commands like cockroach user. Here too there is at most one active session per command running.
Internal channels:
- Event logging: Up to one session is opened for each logged server event and per-replica maintenance operation. This may sound like a prohibitive overhead, since logging is usually a hotspot, but CockroachDB uses a lightweight form of session for this activity specifically.
- Backfilling: One session to support populating newly added columns or indexes. Each node can be running a single backfill per table. With N tables undergoing schema changes, you can see N backfills running simultaneously, with one session per backfill.
- Table lease updates: One internal session every time the node “takes a lease” on a table. Taking (or releasing) a lease means marking the table as being in use on that node; this is done by updating a lease table in the database using SQL. Leases are cached, so that lease-related SQL activity is amortized across multiple SQL transactions.

Notice how table lease updates are not performed on behalf of client sessions, and are instead performed using an internal session. This is mainly because lease cache updates require a different set of database permissions than those available in client session.

Per-session memory usage

When a SQL session appears on a node, a couple of data structures of negligible size are allocated in memory, then the session starts accepting statement (or queries) by clients.

The statements are organized in transactions: single-statement “implicit” transactions for statements issued outside of a BEGIN-COMMIT block, and “explicit” transactions where the BEGIN-COMMIT blocks is defined by the client.

Memory allocations on behalf of a session can thus be further divided into:

session-wide allocations, which can stay active until the end of a session, spanning possibly multiple transactions,
transaction-specific allocations, which are guaranteed to be released when a transaction is committed or aborted, and
session-lingering allocations, for results that stay in the session after a transaction completes and until the client processes the data.

For now, the only session-wide allocations are for prepared statements and portals. The pgwire protocol allows clients to “prepare” statements once in the session, which can be reused multiple times after that without spelling out their SQL in full. After a statement has been prepared, its corresponding SQL code and data structures remain in the node’s memory until the session ends or the client deallocates the prepared statement. Portals add parameters to prepared statements in the node’s memory, and have a similar lifetime.

Next to that, the only session-lingering allocations are result sets. After a query completes, and until the response of the query is fully transmitted back to the client, the results of that query are retained in the node’s memory. We call these packets of results “buffered result sets”. These result sets can become arbitrarily large depending on the SQL code being executed. (Note that we have plans to reduce the amount of buffering for results sets, but this is not implemented at the time of this writing.)

The remainder of SQL memory usage is caused by transaction-specific allocations.

Of these, the bulk of memory usage is caused by in-memory row storage:

Sets of rows before sorting. When you use ORDER BY using non-indexed columns, the server sorts in-memory, so the rows need to fit there.
Unique prefixes. When you use DISTINCT, the server keeps an in-memory hash table of the values already seen to filter out duplicates; this grows linearly with the number of unique values.
Window partitions. When you use window functions, the server keeps partitioned rows in memory for window processing, too.
Unique prefixes for UNION, INTERSECT and EXCEPT.
Aggregate results during grouping. When you use GROUP BY, the server keeps one aggregate row in memory for each distinct group in the input data.
Intermediate JOIN results. In our preliminary implementation, this grows linearly with the full size of one of the JOIN operands; then even with a better JOIN (in the future), you can expect memory usage to grow linearly with the size of the input tables, except in the very specific case where both JOIN operands are ordered optimally.
Virtual tables. When you use a virtual table (like pg_catalog.schemata or information_schema.columns), its contents are populated in memory for the duration of the transaction. The memory cost of these virtual tables is approximately proportional to the complexity of the database schema visible to the user issuing the query.

We have some plans to support disk-based temporary storage for operations involving large tables, however regardless of further work on that front, CockroachDB will continue to support in-memory row storage so as to keep responding quickly to common queries.

Beyond SQL-level row storage, there are two additional per-session indirect costs which can grow arbitrarily large depending on which SQL code is executed:

KV batches. When a statement like INSERT or UPDATE is issued, the SQL layer constructs an object in memory called a “batch” where KV operations on the individual updated/inserted rows and index entries are queued, for later processing when the statement completes.
Write intents. When a statement like INSERT or UPDATE completes, or when creating or modifying the database schema, markers called “write intents” are written to the database next to the new values. This is used by the MVCC transaction isolation so that transactions in other clients (sessions) can avoid reading data from non-committed transactions. These write intents are subsequently cleaned up when the transaction complete, but the cleanup process needs to know which intents have been created. The list of write intents for a transaction is kept in the node’s memory until clean-up occurs.

Predictable SQL memory growth

At this point we would like to bring your attention to the following additional observations:

Each of the non-pgwire channels always delivers the same SQL queries and uses only simple parameters. We have tuned these queries so that the amount of memory they require stays small: they are written to use a small, fixed maximum number of rows in memory for intermediate SQL computations, and keep very small batches and few intents per transaction.
During normal database operation, when the number of external client connections is large enough, we expect the number of non-pgwire sessions to be comparatively small.

Therefore, we would like to posit that the SQL-related memory usage incurred by non-pgwire channels is comparatively negligible. You can perhaps invalidate this postulate by connecting many web browsers to the admin UI of a single node, but this is unlikely to happen, or even be allowed, in production databases.

To enable operators to inspect SQL memory growth, we have recently added new metrics inside each CockroachDB node, which can also be viewed in the admin UI. These metrics separately track the approximate amount of memory used by pgwire clients, the admin RPC and “internal” sessions inside each node. You can now use these metrics to dimension system memory capacity as your database and application grow.

Unpredictable SQL memory growth: when a node could blow up

If you followed the previous analysis closely, you may have noticed that the amount of memory that can be allocated on behalf of pgwire-originating sessions is virtually unbounded!

There are many SQL queries, even seemingly simple ones, that a client can send which can cause a very large amount of memory to be allocated by the node processing the query. So much memory can be allocated, in fact, that it’s possible to exceed the total system capacity and cause the operating system to terminate the server process abnormally.

Here are a few examples:

SELECT * FROM sometable when sometable contains many rows. This can blow up because the result set will be buffered in server memory.
SELECT * FROM sometable ORDER BY somecolumn LIMIT 1. Although LIMIT restricts the size of the result set, this can blow up if somecolumn is not indexed and the table contains many rows.
SELECT * FROM sometable, someothertable LIMIT 1. This syntax defines a cross join between sometable and someothertable. In the current implementation, sometable will be loaded first entirely in memory, and this can again blow up if it contains many rows.
SELECT COUNT(DISTINCT *) FROM sometable. This requires the server to keep a copy of the unique rows in sometable, unless there is a unique index covering all columns in sometable`.
UPDATE sometable SET x = x + 1. This will lay as many write intents as there are rows in sometable, and keep track of all intents in the node’s memory until the enclosing transaction commits.

When we started working on a SQL interface for CockroachDB, we initially thought the risk of memory blow-up could be sidestepped merely by thoroughly documenting how CockroachDB uses memory, including e.g. this blog post and appropriate documentation yet to be written.

However, as time went by we also (re-)discovered an old truth: humans are fallible and developers make mistakes. When trying out SQL statements, whether as a CockroachDB or as an app developer, it is easy to forget about or misjudge the incurred memory costs and issue a query that will allocate a lot of memory in the node. This just happens!

Seeing a node crash because of an accidental user mistake is frustrating. Therefore, in addition to appropriate documentation and monitoring in the admin UI, we now also provide a mechanism that provides a modicum of protection against accidental blow-ups.

The total amount of SQL-related allocations performed by pgwire sessions (ie not by sessions coming via other channels) is now limited by a global setting --max-sql-memory, set to 25% of total system memory by default. This includes both session-wide (prepared statements, result sets) and transaction-specific (in-memory row storage) allocations. When a session causes allocations to reach the global maximum, it now fails with a client-side error, preventing the server’s effective memory usage from growing beyond the limit at which the operating system would kill the process.

Caveats

The analysis up to this point makes a few assumptions that are design goals for CockroachDB, but which are currently neither tested nor enforced:

That the bulk of memory usage is indeed incurred by the RocksDB cache, store-related and SQL-related allocations; and that allocations for node peer metadata are comparatively negligible.
That SQL activity caused by internal channels overall has a negligible memory footprint.
That the admin UI, when loaded from the same node with a small number of web browsers, has a negligible memory footprint on the node.
That the tracking mechanism we have implemented for SQL memory is inherently imprecise. Since Go is a managed language, there are more bytes allocated in memory for every byte usable and visible from CockroachDB’s code. It is hard to pinpoint the difference exactly. We are assuming so far it is a constant factor off from the tracked memory allocations, but we have not fully tested nor validated this assumption.

We pledge to increase visibility on these overheads over time, however until then please contact us whenever you suspect that some significant memory usage has been left unaccounted for.

In particular, because of the last caveat in the list above, do not set --max-sql-memory too high. If you bring it so high that, in combination with the other items discussed in this blog post, it reaches the total memory available to the CockroachDB process, you run the risk that the Go runtime’s invisible allocations are not accounted for and will cause the process to tip over the limit and crash (or become very slow by going to disk swap). If you wish to tune this setting, try different values and monitor in the admin UI how it influences your effective total memory usage under client load. You should probably avoid increasing the value further as soon the total memory usage under load grows beyond 80% of overall capacity available to the process.

Also, at the time of this writing, pgwire-related memory usage is not fully tracked: KV batches and write intents are not accounted for yet. We plan to address this, too.

Finally, we acknowledge that the information in this blog post is more qualitative than quantitative. We have introduced formulas for which some parameters are still largely unmeasured. It may be hard for the reader to step back and create a concrete picture in her mind: which is the largest memory consumer in practice? Store metadata at rest or in-flight store activity? Or SQL activity? To provide this intuition, we intend to run and publish some additional measurements in a later blog post; of course we welcome external validations of this as well.

Summary: CockroachDB’s memory story

This is our memory story: a CockroachDB node mainly consumes memory for the RocksDB cache, replica metadata, in-flight storage operations and data structures needed for SQL processing. Most of this memory usage is relatively stable and/or predictable, following general principles that we have outlined above. We aim at improving our admin UI as the main interface available to operators to inspect, detail, and monitor memory usage.

Additionally, to guard against unpredictable memory usage, which should be caused mostly by accident in external SQL client applications, CockroachDB now implements a simple but effective node-wide soft limit: runaway SQL sessions now simply fail with a client error before they incur a memory blow-up in the node.