Metadata Reference Architecture: A Quick Guide

Metadata Reference Architecture: A Quick Guide

Metadata management is a critical part of any business application. Let’s take a quick look at what metadata is, why it’s important, and how you can architect your application to ensure highly available, consistent metadata at scale.

What is metadata?

Put simply, metadata is data about other data.

Consider, for example, a cloud photo storage application. When a user uploads a photo, the image file itself would likely be stored in an object storage database, but the application would also need to store metadata – smaller data about the image – in a metadata database. This metadata would included details such as:

  • The user who uploaded the photo
  • The date the photo was uploaded
  • The size and resolution of the photo
  • People or objects the user tagged in the photo
  • Any user description or caption for the photo
  • The photo’s location in the object storage database

…et cetera. These metadata are highly valuable for businesses because they make other data easier to find.

For example, having the metadata listed above would make it easy to quickly locate all of a specific user’s photos. Rather than having to search through all of the image files in the object storage database, the application can query the metadata database for all entries with a particular user, and then it will have a list of the locations for each file that specific user has uploaded.

Why metadata matters

In many cases, metadata availability is critical to an application’s functionality. For example, our cloud photo storage application would use metadata to facilitate locating, sorting, and filtering photos. If the metadata database goes offline, the photos would still exist in the application’s object storage database, but they would become inaccessible to users because the application would lack the metadata necessary to locate specific photos in that database.

Consistency is another major concern that arises when companies are architecting metadata management systems. Metadata is often duplicated across multiple databases – for example, the same metadata might be stored both on a metadata database that serves the application and on a separate database for analytics, logging, audit compliance, etc. Companies must ensure that the data on these two (or more) databases remains consistent; if inconsistency is introduced, it can become very difficult to determine which database is correct (which can, in turn, have serious implications for audits, regulatory compliance, etc.).

Consistency across regions can also be an important consideration for multi-region applications – if a region goes down, the other regions will still require access to metadata that is correct to be able to function properly. Moreover, if the regions are not consistent with each other, disaster recovery becomes very challenging.

Let’s take a look at a simple example of an application that handles metadata without having to worry about problems with availability or consistency.

An example of metadata reference architecture

In the diagram below, we’ve laid out a simple example of how an application with a microservices architecture might integrate CockroachDB as a metadata store. Note that we’ve chosen to focus on a multi-region architecture here because of the inherent advantages that multi-region setups offer in terms of both user latency and (in some cases) regulatory compliance.

Metadata reference architecture diagram

Note that in the image above, only three services and one database are pictured per cluster for the sake of visual clarity. A real application would likely have many more services, and those services would be also be sending data to other databases, not just to the metadata database. In a photo storage application, for example, the image files themselves would likely be sent to a different database that is optimized for large object storage.

Requests and data from the front end (which might be a web or mobile application) are sent to a load balancer that distributes them to the appropriate Kubernetes cluster, where they are processed by the application’s microservices.

CockroachDB can be deployed and managed within Kubernetes (rather than just alongside it), and treated like a single-instance Postgres database. But unlike a single-instance Postgres database, CockroachDB is distributed, so even if a database node goes offline, all metadata would still be accessible via other nodes. In fact, depending on how it’s configured, CockroachDB can survive AZ and even cloud region outages.

In the architecture above, we’re solving the potential consistency problems inherent in building a metadata store for a multi-region application in two ways.

First, to solve the potential consistency issues that can arise from the dual-write problem, we’re using CockroachDB’s Change Data Capture (CDC) feature to copy metadata to Apache Kafka (or any message queuing system( and then into an analytics database. We could accomplish the same thing on a database that didn’t include CDC using a transactional outbox.

Second, to solve the potential consistency problems that can arise with multi-region, we’re taking advantage of CockroachDB’s multi-active availability model, which avoids some of the problems inherent in active-passive and active-active configurations and allows for synchronous and performant writes across regions natively.

The choice of CockroachDB here also enables an easy road to multi-region for developers, since multi-region CockroachDB databases can still be treated as a single logical database by the application. This ensures our metadata will be highly available, and also allows for “data homing” down to the row level, which can be helpful for both latency (locating data in the cloud region physically closest to the user) and regulatory compliance.

Of course, real-world metadata architectures can get significantly more complex. When designing the architecture for your own application, it may be helpful to look at public examples such as Netflix’s device management architecture, which uses CockroachDB to store metadata related to all of the different hardware devices with which Netflix apps are compatible.

About the author

Charlie Custer

Charlie is a former teacher, tech journalist, and filmmaker who’s now combined those three professions into writing and making videos about databases and application development (and occasionally messing with NLP and Python to create weird things in his spare time).

github link linkedin link

Keep Reading

How to lower p99 latency by geo-partitioning data

Running a geographically distributed database has a lot of benefits. We see enterprise companies, startups, and students …

Read More
Flexible & Correct Identity Access Control Models

One mistake that should never be made is to assume permissions and identity access are easy. If you plan on just …

Read More
How to Handle Early Startup Technical Debt (Or Just Avoid it Entirely)

All early startups share the same first goal. No matter which sector you’re aiming to disrupt, and no matter what …

Read More
x
Developer Resources