We're building next-gen UDFs from the ground up.Learn more
When people think about improving application performance, they’re often thinking of cleaning up the code in the application layer. However, the database layer can also be optimized to improve application performance, and so can the ways in which these two layers – application and database – interact with each other.
That is where user-defined functions (UDFs) in SQL come in. If you’re looking to speed up your application, UDFs might be a way for you to make that happen.
In the context of a SQL database, a UDF is a function that is written by the user and executed by the database software.
SQL UDFs accept inputs, perform actions using those inputs, and return the output. The exact syntax for doing this can vary based on your specific database software, but SQL UDFs typically have at least four elements:
Here’s a basic example:
CREATE FUNCTION add(a INT, b INT) RETURNS INT AS 'SELECT a + b';
In this example:
CREATE FUNCTION add()creates a function and names it
a INT, b INTdefines the arguments (i.e., the two inputs the function takes, which are integers, in this case).
RETURNS INTspecifies that the function should return an integer.
AS 'SELECT a + b'defines the actual code the function runs (in this case, adding the two inputs together).
We could invoke this function in SQL as follows:
This would return the output
(Needless to say, this is a rather complicated way to do something as simple as add two integers, but it’s just an example. The beauty of UDFs is the user-defined part: your SQL UDFs will do whatever you want, based on the code you write in the function body.)
If you’re familiar with the concept of functions in other programming languages, this all probably looks familiar. SQL UDFs work just like you’d expect functions to in any language.
What’s interesting about UDFs in SQL isn’t what they can do, it’s where they can do it. The fact that they’re executed in the database is what makes them useful. Let’s talk about why.
The architecture for almost any application will include an application layer, where all of the business logic of the application lives, and a database layer, where data relevant to the application is permanently stored.
Data moves between these two layers regularly, but that movement is not free. Sending data back and forth between these layers constantly will increase the latency of your applications and, depending on how your application is deployed, it could increase your costs, too.
To demonstrate why, let’s imagine we want to add two values that are already in our database together. Needless to say, this is a simplified example — creating a SQL UDF isn’t the most efficient way to simply add two integers — but the table below should illustrate the efficiency that SQL UDFs can bring to an application:
|Using a SQL UDF||Using a function in the application|
As we can see, using an in-application function requires an extra step. It is less efficient.
This is not to say that all functions should be executed in the database! Determining the most efficient approach for any particular process will be case-dependent. In some cases, like the simple one illustrated above, using UDFs allows developers to make their applications more efficient by cutting down on the number of trips data must take between the application and database layers.
While UDFs can improve application performance by preventing round-trips between the database and the application, in some contexts they can also serve as a performance bottleneck.
In many distributed database configurations, for example, adding UDFs can harm performance because they can only run on a single active node.
Some distributed databases, such as CockroachDB, have eliminated this problem by making the execution of UDFs distributed as well – in CockroachDB, any node can execute a UDF in the same way that any node can execute a read or write query.
The process for creating UDFs, and even the language you’re writing the functions in, varies based on the specific database technology you’re using. A full tutorial for every database technology is beyond the scope of this article, but in general every SQL UDF will include at least the four elements we discussed earlier:
Let’s take a look at how our
a + b example UDF could be created in a few specific types of databases.
CREATE FUNCTION add(a INT, b INT) RETURNS INT AS 'SELECT a+b' LANGUAGE SQL;
DELIMITER $$ CREATE FUNCTION add ( a INT, b INT ) RETURNS INT DETERMINISTIC BEGIN DECLARE added_together INT; SET added_together = a + b; RETURN added_together; END$$ DELIMITER ;
CREATE FUNCTION add(a INT, b INT) RETURNS INT LANGUAGE SQL AS 'SELECT a + b';
You may notice that the CockroachDB UDF is identical to the Postgres one. That makes sense, since CockroachDB is Postgres wire compatible, and the syntax is very similar.
There is a difference, though: CockroachDB is a distributed database, and UDFs in CockroachDB are distributed as well, which makes them less of a performance bottleneck than they can be in traditional SQL databases and distributed databases that don’t support distributed UDFs. This makes CockroachDB a good choice for developers who like the efficiency offered by UDFs but who don’t want to sacrifice the high availability and scalability advantages inherent in a distributed database.
Learn more about UDFs and distributed functions in CockroachDB in our documentation.
Up to this point, we’ve been using a very simple
a+b function to demonstrate how SQL UDFs work. In practice, however, UDFs are capable of much, much more.
For example, here’s a real-world UDF written by CockroachDB user Saqib Ali. It takes timestamp data and converts it into a more human-friendly, x hours ago format. (This function is available on Github here if you’d like to contribute to it).
CREATE OR REPLACE FUNCTION humanize_time_span(ts timestamp) RETURNS varchar AS ' SELECT CASE WHEN (now() - ts::timestamp)::INT < 60 THEN '' Just Posted'' WHEN ((now() - ts::timestamp)::INT) / 60 < 60 THEN ((now() - ts::TIMESTAMPTZ)::INT / 60)::INT || '' Minutes ago'' WHEN (now() - ts::TIMESTAMPTZ)::INT / (60 * 60) < 24 THEN ((now() - ts::TIMESTAMPTZ)::INT / (60 * 60))::INT || '' Hours ago'' ELSE ((now() - ts::TIMESTAMPTZ)::INT / (60 * 60 * 24))::INT || '' Days ago'' END ' LANGUAGE SQL;
Working with UDFs can also get complicated when you’re working with actual data from your database. If you’re not careful, the functions you write can have unintended side effects and/or mutate the data in your tables in ways you did not intend.
For that reason, some databases have additional language you can add to a function that provides information about the function’s volatility — the extent to which it can impact your other data — and determine how the database handles it.
In CockroachDB, for example, we might rewrite our
a + b example function like so:
CREATE FUNCTION add(a INT, b INT) RETURNS INT IMMUTABLE LEAKPROOF LANGUAGE SQL AS 'SELECT a + b';
Note that we have added
IMMUTABLE LEAKPROOF to the statement.
IMMUTABLE means that the function cannot mutate data, and
LEAKPROOF means the function has no side effects (nothing depends on its arguments besides the return value). These tags are important for telling CockroachDB’s cost-based optimizer how it can expect a function to behave.
Efficiency matters. When you’re working with large amounts of data, it matters a lot.
Every trip between your …Read more
Upserts are useful for anyone who works with a database to know, but the term “upsert” might not even appear in your …Read more