Learn about how migrating to distributed SQL can unlock next-level performance for your relational databases.Let's go
A poorly-optimized database can drag your entire application down with it.
Luckily, when it comes to relational databases, improving performance might be more straightforward than you think. While there are a variety of factors that can impact SQL database performance and we can’t cover all of them, in this article we’ll look at some of the most common and most impactful ways to optimize the performance of your relational database.
While there are a variety of database configuration settings and options that can impact performance depending on the specific RDBMS you’re using, a lot of optimizing relational database performance comes down to optimizing queries.
The first step towards optimizing queries is determining which queries are the most computationally expensive and figuring out why.
The way you identify your most expensive queries will depend on your specific RDBMS. Most relational databases have statistical tables you can query to find out things like how often a particular query is run, and how much time it takes to execute. More advanced databases may have web UI elements or advanced flagging features that will bring problematic queries to your attention, and there are also a plethora of database monitoring tools that can also help you find out which queries have the biggest impact on performance,
Once you’ve identified the query or queries you’d like to improve, though, the
EXPLAIN ANALYZE SQL command will help you figure out what the query is actually doing and find ways to improve it.
EXPLAIN ANALYZE, followed by the query you want to analyze, will execute the query and then output data about how the query was executed.
For example, imagine we have identified a
SELECT query that runs on a
messages table of email messages as a candidate for improvement. We would run
EXPLAIN ANALYZE against our query like so:
EXPLAIN ANALYZE SELECT title, body, sender, datetime FROM messages WHERE email = 'firstname.lastname@example.org' ORDER BY datetime DESC LIMIT 10 ;
Running that command will output something like this:
info - - - planning time: 683µs execution time: 3s distribution: local vectorized: true rows read from KV: 1,000,000 (226 MiB) cumulative time spent in KV: 3s maximum memory usage: 10 MiB network usage: 0 B (0 messages) • top-k │ nodes: n1 │ actual row count: 10 │ estimated max memory allocated: 10 KiB │ estimated max sql temp disk usage: 0 B │ estimated row count: 2 │ order: -datetime │ k: 10 │ └── • filter │ nodes: n1 │ actual row count: 1,000 │ estimated row count: 2 │ filter: email = 'email@example.com' │ └── • scan nodes: n1 actual row count: 1,000,000 KV time: 3s KV contention time: 0µs KV rows read: 1,000,000 KV bytes read: 226 MiB estimated max memory allocated: 10 MiB estimated row count: 1,000,000 (100% of the table; stats collected 6 seconds ago) table: messages@primary spans: FULL SCAN (35 rows) Time: 3.106s total (execution 2.991s / network 0.115s)
This output should be read from bottom to top, and although it starts with some information about execution time, we should ignore that – that number can vary based on factors we can’t control, such as how many of the results are already in memory.
There’s a lot of data here, but at a glance we can already see a few important things:
FULL SCAN, meaning that our query is looking at every single row in the table. Full table scans are bad for performance, so getting rid of them (we’ll talk about how later) is another way to optimize a query.
estimated row count). Anytime we can reduce the number of rows scanned in a step, we’re making the query more efficient.
We can see that this query is a great candidate for optimization! Conversely, if we had gotten back a single-step query plan that only spanned ten rows, we’d know the query was as efficient as it possibly can be, and we should move on to looking at other queries and/or investigating other approaches to improving performance.
When we do find a problematic query, how can we improve it? The answer will depend on a variety of factors, but here are some best practices to help make queries more efficient:
Indexes are critical for SQL performance – so critical, in fact, that we’ve given them their own section later in this article. Here, we’ll just say that the query from the previous section – the one that was spanning a million rows – was doing that against a table that did not have an index. If we create an index that covers this query entirely, like so:
CREATE INDEX email_datetime_storing_title_body_sender_idx ON messages (email, datetime) STORING (title, body, sender);
…the result will be a single-step query plan that only reads ten rows.
Now, creating a unique index to cover every query that runs against every table in your database is impractical, and that would likely hurt performance – indexes do have trade-offs, as we’ll discuss in the index section. But for your most expensive queries – often, the ones that run most often – creating indexes can be a very powerful way to optimize query performance without actually rewriting the query at all.
SELECT * is faster to type, but it can be a lot slower to execute than a more specific query, especially if your tables are large.
In general, it’s best practice to only query the fields you actually need.
SELECT DISTINCTon large tables
SELECT DISTINCT will return only the unique entries within your query, but it is less efficient than using
WHERE to do the same thing.
On smaller tables, this may not matter much. The bigger your table gets, the greater the impact
SELECT DISTINCT’s inefficiency will have on performance.
INNER JOINwhere possible, try to avoid full outer joins
While the specifics may vary somewhat based on your RDBMS of choice and the data you’re working with, generally speaking an
INNER JOIN will be processed more efficiently than an
FULL OUTER JOINS are the least efficient.
When dealing with subqueries – queries within other queries – try to keep the number of results they return small. Again, the specifics will vary depending on your RDBMS and the specifics of your query, but the subquery results may be loaded into memory in their entirety, which can lead to memory (and by extension, performance) issues if the subquery returns a large number of results.
When we use
UNION, it returns only unique entries, but because it has to check for duplicates, it is less efficient than
UNION ALL (which just returns all relevant entries, including duplicates).
As mentioned earlier, indexes are critical to optimizing SQL performance.
This is because they function similarly to how indexes function in books: if the book has an index and you’re looking for instances of a particular term, you can find your term quickly in the index (it’s alphabetical), and it will tell you exactly which pages it occurs on. You might only have to read a couple of different pages to find every instance of the term you’re looking for. If there isn’t an index, on the other hand, to be sure you’d found every instance of your term, you’d have to read every single page in the book.
The theory behind SQL indexes is the same. When the database is executing a query, it uses the relevant index like a pointer, enabling it to read only the relevant rows of data to find the information you’re searching for, instead of having to read the entire table.
However, most indexes must be created manually. We decide what indexes exist, and what fields of our table they will index. This is where the opportunity for optimizing performance lives.
No two databases are the same, but here are some things to keep in mind when creating indexes:
EXPLAIN ANALYZEto find queries with multi-step execution plans and try to reduce them by creating indexes.
With that said, it’s important to keep in mind that creating indexes isn’t a catch-all solution, and too many indexes can actually hurt database performance. Every time the table is updated, all of the relevant indexes must also be updated, so while indexes will speed up reads, they will have a negative impact on writes.
The right balance will vary based on the specifics of your workload, but in general it’s a good idea to cover your most commonly-run queries with indexes, but not go overboard and try to ensure that all queries are fully covered.
And because indexes do add a computational cost to write operations, you should regularly check for and remove unused (or only infrequently used) indexes.
For more details, we have a whole article on SQL indexes that walks through some examples.
While optimizing your database schema isn’t likely to have the kind of impact on performance that optimizing queries and indexes will, it is still a factor that can affect database performance.
For performance, it’s best to store the smallest amount of information you can while still meeting your needs. For some columns, you may have multiple potential options for data types, and when that’s the case you should opt for the data type that captures the information you need with the smallest possible footprint.
Different database management systems have different data type options; we’ll use an example from Microsoft SQL Server to illustrate. SQL server can store integers in four different data types:
tinyint. Let’s compare just two of these:
The larger option,
bigint, can be used to store numerical values anywhere between (roughly) negative nine quintillion and nine quintillion. But because of that range,
bigint also uses 8 bytes of storage. In contrast,
tinyint can only store values between zero and 255, but it only takes 1 byte of storage.
So, if you were using SQL server, and you had a table with a column that required an integer data type, and the integer values will only ever fall between zero and 255, you would want to opt for the
tinyint data type.
That’s just one example from one RDBMS, but broadly, it’s a good idea to understand the various data types supported by your RDBMS, and choose the types that require the smallest possible amount of storage for your use case.
While it won’t impact day-to-day performance, schema changes in many RDBMS can have a huge negative impact on the user experience when you’re actually making them, because they require tables to be locked – effectively taking those tables offline – while they execute.
This is not true of all relational databases. CockroachDB, for example, allows for online schema changes that keep the database available and functional as they are executed. But that is the exception rather than the norm. If you’re not using a database that supports live schema changes, you’ll have to settle for scheduled schema changes and try to schedule them when they’ll have the smallest possible impact on your user base.
There are a huge variety of other factors that can impact SQL database performance – far more than we could hope to cover in a single article. But here are a few other things to look out for as you work to optimize your database for performance.
While various relational database management systems are similar in many ways, each also has its own unique factors and settings. This means that the best performance choice for one RDBMS may not be the right choice for another.
For example, consider a table with an
id column. When you’re using a legacy RDBMS, it would not be uncommon to use a monotonically increasing integer for the id – the first row entered into the table is assigned
1, row two is
2, etc. From a performance perspective, this is fine (although there are other reasons why it might not be a great idea).
CockroachDB, however, is a distributed SQL database that spreads your data across multiple machines (nodes), any of which can process reads and writes. That can come with significant performance advantages over a traditional single-instance deployment. Using sequential IDs is an anti-pattern that nullifies some of those advantages.
These are not factors that you would need to care about if you’re using single-instance Postgres, for example, because that’s not a distributed SQL database. There’s no way to get the performance advantages that come with a multi-active distributed system anyway, so you’re not giving any performance up by using sequential IDs. But with CockroachDB, using sequential IDs effectively speed-limits your database and throws away some of those distributed database performance advantages.
In other words: it’s important to know the specific best practices associated with your RDBMS of choice, as the most efficient approach for one specific RDBMS might not be the ideal approach for another, even if one is compatible with the other (as CockroachDB is with Postgres).
In some cases, the specifics of your deployment may also have performance implications, or cause you to evaluate design approaches differently.
CockroachDB, for example, supports multi-region deployments and regional-by-row tables, allowing you to assign specific rows to specific cloud regions. Locating data closest to the users who access it most will increase performance by reducing latency, so where you locate your data can matter if your database is deployed across multiple regions, and you’ll want to design your tables and decide which tables are regional-by-row accordingly.
However, this is obviously not a concern if you’ve deployed your database in just a single region.
Broadly speaking, when optimizing a SQL database for performance, it’s important to consider factors including:
These aren’t the only factors that can impact relational database performance, of course, but if we can check all of the above boxes, we should be able to get quite close to optimal performance in many cases.
Indexes are a crucial part of your database schema. They improve workload performance by helping the database locate …Read more
NOT NULL constraints are a ubiquitous tool for upholding data integrity. However, their performance benefits are less …