Data Localization Compliance Strategy

For all the sound and fury GDPR generated last May, it’s really just the amuse-bouche for the data localization regulations to come. Cockroach Labs CEO Spencer Kimball details the reasons why regulations will proliferate like rabbits in the future. But it’s hard for database architects and administrators to plan for compliance against a hazy future. What can you do today to get compliant for tomorrow’s data localization legislation?

Many companies are responding to this challenge by simply abandoning regions in which they either did or hoped to do business. 74% of the Fortune 2000 companies included in a 2017 Accenture report said that they would exit, delay, or abandon entry into markets because of data localization regulations. The task of getting compliant is so difficult, they’re willing to leave money on the table. But that doesn’t have to be the case. There are a number of different compliance solutions, depending on what your database looks like now, and what changes you’re willing to make.

This blog will evaluate options for pursuing data localization compliance and provide useful guidance applicable to companies of every size, from every industry.

What is data localization?

There are a number of facets of data protection laws. Today, we’re looking at data localization specifically.

Data localization (often called ‘data sovereignty’ or ‘data residency’) refers to the laws that regulate the collection, storage, and transfer of user data. In essence, the existing data localization regulations (and those to come), require that data remain in specific locations.

Even GDPR touches upon data localization, albeit a bit indirectly. GDPR imposes restrictions on the transfer of personal data outside of the European Union or to countries which have been deemed to have data protection “adequacy”. Interestingly, this does not include the United States. While companies are allowed to transfer data out of the European Union, they can only do so with their customers’ explicit, informed consent.

But this is just the beginning of data domiciling legislation, and it’s likely to become much stricter, with boundaries drawn that can’t be crossed by a user clicking an “I agree” button on a pop-up. This draconian future is where things get tricky. How do you prepare for that kind of compliance, and where will you house data when that future arrives?

3 Common Data Localization Compliance Options

There are many different species of database architecture. Each with their own nuanced capabilities and limitations. In this post, we’ll focus on two general database architectures: on-premise monolithic databases and distributed cloud vendor databases. The intention is to show what the path to compliance looks like from each of those starting points.

Option 1: Scale your monolithic database

Even the word ‘monolithic’ sounds like something from prehistoric times. The actual origin of a monolith is a ‘column formed from a single block of stone.’ It is tall, strong, and powerful. Monolithic databases were originally built in the early 1970s and they’re still valuable today, but because they require scaling up just one master node, they are difficult to scale horizontally.

If you’re reading this blog then you already know the time, cost, and complexity of replicating monolithic databases. If your business is operating on a global scale the cost of all these replications is massive. The labor of manual sharding these databases is time-consuming, hurts morale, and prevents engineers from working on other important elements of your apps.

If you have deep pockets and the cultural cache to attract (and replace) top engineering talent, then scaling a monolithic database is absolutely an option. It’s just not an inexpensive endeavor. If you have one database in the U.S. right now, in order to scale horizontally you’d need to replicate that structure in every country that has different regulations than your current location. Keep in mind that having a monolithic database and a backup in every region requires managing each database separately. This means more administration costs.

The shortcomings of monolithic databases have become liabilities, and somewhere Charles Darwin is licking his chops.

Option 2: Get compliant on a cloud database

There are wonderful benefits to moving your database into the cloud and if you aren’t already there you’re likely on your way. The cloud is less expensive, requires no maintenance, offers economies of scale, and many other benefits, without any long-term contracts. Bountiful fruit to be sure. But there are a couple of concerning caveats to keep in mind before you leave your on-premise data centers to hop in bed with a cloud vendor database.

The first issue is monogamy. Otherwise known as ‘cloud lock-in.’ If you use Amazon’s database product, for example, you have to use Amazon’s cloud. The same holds true for Microsoft, Google, and Oracle. This limits your flexibility, particularly if you were dreaming of one globally distributed database with some data in the cloud and sensitive data remaining on-prem.

Being locked into one cloud also means that you’re locked into their coverage. Google is surprisingly less omnipresent than you would imagine. At the time of writing this blog, Azure actually covers the most earth with 54 regions worldwide. Before you hitch your wagon to one of these cloud providers make sure that they have a presence in all the regions that you plan to do business. Here is a chart that captures the current coverage of major cloud providers in countries with high GDP.

cost-based optimizer memo

The other important caveat is that the distributed database offerings of Amazon (Aurora), Microsoft (Cosmos), and Google (Spanner) do not offer geo-partitioning of data. They partition data and replicate it without giving you the control to domicile specific data in specific regions at the data layer. This is a problem for achieving compliance because the data is not technically domiciled exclusively in one country or region. What you’d have to do in these circumstances is out a set of nodes in one country and make sure they don’t talk to other nodes, which is the same configuration as a monolithic database. Or you can set up a separate cloud in each country with domiciling requirements. Which means that you’ve given up on having one global database. These database products are simply not tailored for compliance with domiciling requirements.

Option 3: Play Possum

Depending on your business needs and your operational bandwidth it’s not totally absurd to look around at the enforcement of data protection laws, shrug your shoulders, and decide to do nothing. This is a defensible short term strategy.

The 57 million dollar fine that France bestowed upon Google did not exactly strike fear into the hearts of businesses worldwide. Obviously, nobody wants to be fined 57 million dollars. But for Google, this number is more mosquito bite than shark bite. It would have been much more impactful if the enforcement had targeted a Fortune 2000 company or a European company (or basically any company other than Google, Facebook, or Apple). It also would have really blown everyone’s hair back if they enforced the max fine which, in this case, would have been up to 4% of Google’s global revenue.

Where the possum routine unravels is in the inevitably more complex future of data localization laws. GDPR currently allows for relatively easy short-cuts to compliance. That’s likely to change. The United States will either form some kind of coherent regulations, or individual states (ahem…California) will continue to construct silo’d requirements (turning the U.S. into a field of data regulation land mines). China is a huge financial opportunity that is forcing companies to abandon ship if they can’t domicile data and they’re just getting started.

The point is that if you play possum now, you’ll make yourself vulnerable in the future. And it might be hard to catch-up. While you’re trying to put together a compliant database strategy your competitors will eat into your market share.

Option 4: Move to a cloud-neutral database

Lucky for everyone trying to run applications in more than one geographic location there are solutions in place that don’t require costly replications of monolithic stacks or frequent trips to the laptop for manual sharding.

There are two specific database boxes that you should be able to check while building your strategy for achieving compliance: Hybrid/Multi-Cloud Freedom and Geo-Partitioning.

The database you choose should offer hybrid and multi-cloud architecture.

There is more to this concern than just the claustrophobic sensation of operational lock-in. If you choose to go with a cloud vendor database you become subject to whatever pricing changes occur in their corresponding clouds. As well as the decisions those clouds make about where to offer coverage. Not to mention performance, which can vary significantly from one cloud to another.

In our 2018 Cloud Report, for example, we discovered that AWS outperformed GCP on a number of benchmarked tests including throughput and latency. The clouds are competitive. The competition drives innovation. If you’re not locked into one particular cloud then you can take advantage of the improvements made by others.

Aside from start-ups, most companies will have some quantity of sensitive data that they can’t move to the public cloud. In which case they should have a hybrid solution. Instead of two disparate databases to house some data on-prem and some data in the cloud, you should have one distributed database that allows you to move data from on-prem to the cloud and vice-versa. This would also allow you to tie certain data to a particular cloud and other data elsewhere.

Why not give yourself an all-you-can-eat buffet of clouds and data centers to de-risk your investments and ensure that you’ll have the flexibility to achieve compliance no matter what regulations get cooked up in the years to come?

The database you choose should offer geo-partitioning.

First, a clarification: ‘Partitioning’ and ‘Geo-Partitioning’ are two different capabilities. All the cloud databases mentioned in this blog are able to partition data. That just means they separate the data to keep it replicated and balanced across nodes. None of them are able to geo-partition data.

We’ve written extensively about geo-partitioning in the past so if you want to jump into the deep end you can do so here, and here, and here (that last link is a video worth watching). In essence, geo-partitioning allows you to control data locality at the data layer, as opposed to requiring manual schema changes and application logic.

With geo-partitioning, developers now have row-level replication control. This will allow for your application to automatically maintain compliance with data domiciling requirements. Also, by virtue of keeping data closer to the user, you reduce the travel time of data, which lowers latency.

Walk yourself through this simulation to see what geo-partitioning looks like with CockroachDB.

At the time of writing this blog there is not another distributed SQL database on the planet that offers geo-partitioning. This is, in large part, because CockroachDB is the only database that was architected from the ground up with this use case in mind. Our expectation is that your database will become a platform for driving compute optimization. All data will have time and location, not just for latency and compliance, but for optimization of compute.

The other database options mentioned in this article were built in different eras. Their architecture is not meant to handle the automated sharding and row-level data control that CockroachDB offers. And that’s okay. For some companies that will work perfectly fine. But if your company has a global presence or multi-national aspirations you need a database built to handle modern end-user expectations and future data localization regulations.

GDPR: How to Scale Your App with GDPR Compliance in Mind

Download the Guide