Best practices for building a pain-free metadata store

What is metadata, exactly?

Metadata is data about other data.

That’s a very broad definition, though. When you start trying to get specific, it can get murky pretty quickly. In our experience with our customers, we see “metadata” commonly used to describe two different sorts of use cases:

Forensic data. This includes things such as log files and machine data — the kind of data you’d want to dig into to analyze a security incident, for example. It’s often time series data.
Control data. For example, this is the data you might have on file describing a particular user, which could include login credentials, billing details, application and usage data, etc.

Different industries and even different companies may define metadata in different ways. But in this article, we’re going to focus on number two: control data use cases.

This type of metadata presents some unique challenges, because while it is “data about other data,” it’s often mission-critical. Data about a user’s login credentials, for example, is required for the application login to function. Data about an item’s location in a warehouse is required for it to be picked and shipped.

Moreover, it tends to be relational. A user’s login credentials and billing information and application usage data are likely all going to be stored in different tables — potentially in different databases — but they all must be connected to each other, too. Designing a schema that’ll work for complex, interrelated data like this can get complicated fast.

So how do you work with this kind of metadata?

Find the right database for your data

The first step is to assess your needs, and find a database that fits well, rather than just a database that you can make fit.

Why? Because while it might be fastest to implement a Postgres database (for example) if that’s what you’re most familiar with, the limitations of that database might end up costing your more time in the long run. Taking the time to assess your needs and search for the right fit up front can ensure you’re not signing up for a ton of manual work – or even worse, a migration to a different database – later down the line.

Sometimes, this search is relatively easy. If you’re trying to store forensic metadata that’s in time-series format, there are dedicated solutions like TimescaleDB ready and waiting.

Often, though, it proves to be more complex. Data security firm Rubrik, for example, started with Cassandra when building the metadata store for their CDM. And at first glance, it appeared to meet the company’s requirements that its database be scalable, fault-tolerant, performant, and easy to deploy.

But the devil is in the details, and when they got it up and running, they discovered some issues. They’re detailed in full here, but the biggest problem was with consistency. Cassandra’s use of tombstones — marking rows for deletion after a fixed period rather than immediately deleting them — could actually lead to deleted rows being “resurrected” and re-added to the database if a node outage happened at the wrong time.

While there were workarounds for this and the other issues Rubrik faced with Cassandra, they ultimately decided to try to find a distributed database that met their needs for scale, fault-tolerance, performance, and ease-of-use and that could also offer stronger consistency guarantees even in the face of node outages. That led them to CockroachDB.

That said, CockroachDB isn’t the right solution for every metadata use case! The needs of an application, and to a lesser extent the wants of you developers, should guide your choice of database for your metadata store.

Design a thoughtful database schema

Once you’ve chosen your database, it’s time to design your schema. Like choosing the right database, taking the time to carefully design your schema costs you time up front, but saves you time in the long run. Making major changes to your schema once your application is in production can be a big pain – we’ll get to that later in this article – and thoughtful schema design can help you avoid that, at least for a while.

Designing schema can be a challenging step, particularly when you’re working with a lot of complex and interrelated data (as you often will be). It requires careful planning and foresight, because having to make significant schema changes after your application is up and running can be painful (we’ll cover that later in this article).

There’s no one-size-fits-all guide to schema design, unfortunately, as your schema will be highly dependent on the specifics of your data. However, we have put together a quick guide to database schema design best practices that should help you get started.

This is also the step where it makes sense to spend time thinking about how your metadata store will interact with the various application services that will be using it. If you’re using a relational database such as CockroachDB, the database itself can enforce your schema, but if you’ve chosen a NoSQL database, some of your schema requirements may have to be implemented in the application itself.

Don’t forget the “human schema”

In addition to designing the database schema, it’s also important to consider what we might call the human schema, determining who will be interacting with what data in what ways, and preparing resources accordingly.

What’s needed here will vary quite a lot depending on the size of your organization, but one task that should almost always be included is creating a data dictionary that defines every column in every table of your schema, ideally in terms even non-technical team members can understand.

Even if a single developer is the only person touching the data right now, if your application takes off, you’ll ultimately have marketers, analysts, product folks, and potentially even executives digging into the data. Having everything thoroughly documented and ensuring the correct access controls are in place will ensure that everyone can access and understand the data they need.

Note that this doesn’t mean giving all of these people access to your production application database! In the long run, you’ll probably want most people from your team to be working from a different database, as you don’t want your application’s performance to be impacted by (for example) queries from the analytics team. Ideally, your production database will have a feature like CockroachDB’s CDC that makes it easy to push data from the application database into Kafka and/or a separate analytics database, maintaining consistency between the two to ensure accuracy for both your application and your analysts.

Schema changes should be additive

In an ideal world, you could design a perfect schema, deploy your application, and never have to change anything.

In reality, I’m pretty sure this has never happened.

Schema changes are a fact of life and they’ll be necessary as your application grows and changes. A good general rule is that these changes should always be additive. You might, for example, add a column to a table to accommodate data being generated by a new feature of your application.

The reason for this is that all of your application services have to be able to understand and work with the data in your database. Adding new data (which can simply be ignored by any application service that doesn’t use it) is preferable to removing data (which will break any application service that used it).

Either way, though, you’ll almost certainly have to make changes to some of your application logic to account for the changes you’ve made to the database schema.

Additive schema changes are less problematic than reductive changes, but they do often mean that you’ll have to come up with an approach for backfilling old rows that don’t have data for the column (for example) you just added.

Ideally, this can happen gradually, as a background process. For example, a photo sharing app that adds an image recognition feature would need to change its metadata schema to store data about the faces in each photo. Photos uploaded and processed before that feature was added would lack that data, so the app developers would need to build a process for gradually scanning all old photos to backfill that data.

The faster this backfilled data is needed, the more challenging a task this becomes.

When migrations arise, ensure consistency

It’s not ideal, but migrations happen. At a certain point, small schema changes may not be able to accommodate all of the changes in your application, and migrations can become necessary. This could mean migrating to an entirely new schema, or even — like in Rubrik’s case — a new database.

Metadata migrations can be highly challenging, as you’ll often need to find a way to establish consistency between your old and new systems until you can fully switch to the newer one, and you’ll also need to update all of the application services that interact with your metadata to ensure they’re ready to work with the new schema once its up and running.

Unfortunately, there’s no magic pill here – what’s required will be highly dependent on the specifics of your application and what you’re migrating to and from. But that’s all the more reason to follow the previous tips in this article! With sufficient planning up front, you can avoid the pain of migration for years to come.