Dealing with distributed database performance issues? Let’s talk CDNs.
Even though they’re at different levels of your tech stack, distributed databases and content delivery networks often share a similar goal: improving the availability and speed between your service and your user. To deliver content as quickly as possible (at least when it’s static), one of the first tools teams reach for is a content delivery network. CDNs leverage a whole stack of technologies to rapidly deliver resources to users, but one of the more impactful strategies is to simply replicate data all over the globe, so user’s requests never have to travel far. In the parlance of operations teams, this is a “multi-region deployment.”
Despite being geographically distributed, CDN replication is relatively straightforward: they simply distribute a file to more and more servers. Because the data changes infrequently (or never), life is easy.
For distributed databases, though, it’s been another story. Because managing state across a set of machines is a hard problem to solve, they typically make unattractive trade-offs when distributed far and wide. Businesses have often been unwilling to accept these compromises (and rightfully so), causing them to shy away from and write off multi-region deployments. But not without paying a price.
While siloing data in a single region makes things easier, the strategy actually impinges on the two things you need from your database’s performance: speed and availability.
In 2020, computers are still bound by physics, and cannot compete with the speed of light. Being farther away from a service means it takes longer to communicate with it. This is the fundamental reason we use content delivery networks and stateless services wherever we can. However, even with a razor-thin time-to-first-byte, an application’s user experience can still falter if it has to communicate with a database thousands of miles away.
This problem is compounded by the fact that latencies quickly become cumulative. If your SLAs allow for a 300ms round trip between an app and a database, that’s great––but if the app needs to make multiple requests that cannot be run in parallel, it pays that 300ms latency for each request. Even if that math doesn’t dominate your application’s response times, you should account for customers who aren’t near fiber connections or who live across an ocean: that 300ms could easily be 3000ms and requests could become agonizingly slow.
If you need a gentle reminder as to why this matters for your business, Google and Amazon both have oft-cited studies showing the financial implications of latency. If your site or service is slow, people will take their attention and wallets elsewhere.
A simple solution: deploy your data to the regions where your users are.
A distributed database’s performance isn’t measured solely in ms; uptime is also a crucial factor. No matter how fast your service normally is, if it’s down, it’s worthless. To maximize the value of their services, companies and their CTOs chase down the elusive Five Nines of uptime (which implies no more than 26 seconds of downtime per month).
Achieving Five Nines requires reliable data centers with incredible networks––but what about forces of nature beyond your control? Forrester research found that 19% of major service disruptions were caused by acts of nature that could take down a cloud host’s entire region: hurricanes, floods, winter storms, and earthquakes.
As Hurricane Sandy proved, these events can be powerful enough to cripple companies deployed in only a single region, including blogosphere titans like BuzzFeed and The Huffington Post. With their sites down, they couldn’t fulfill their mission of delivering content on the world’s latest events, and instead themselves became a collateral story.
Another facet of your application’s availability guarantees is defining its point of recovery (or Recovery Point Objective/RPO) in the case of one of these catastrophes. This is particularly crucial for your customer’s data. If your data’s only located in a single region and it goes down, you are faced with a “non-zero RPO”, meaning you will simply lose all transactions committed after your last backup. If they’re mission-critical entries, you face the risk of losing not only revenue, but also your user’s trust.
So, as the weather gets weirder, the best way to ensure your application stays up and doesn’t lose data is to distribute it far and wide. This way, your users’ data is safe even if swaths of the globe go dark.
While latency and uptime make great headlines, there’s an ever-unfolding story that makes single-region deployments largely untenable: data regulations.
General Data Protection Regulation (GDPR), in particular, requires that businesses receive explicit consent from EU users before storing or even processing their data outside the EU. If the user declines? Their data must always (and only) reside within the EU. If you’re caught not complying to GDPR? You’ll face fines of either 4% of annual global turnover or €20 Million, whichever is greater.
When you take GDPR in the context of the existing Chinese and Russian data privacy laws––which require you to keep their citizen’s data housed within their countries––there’s a clear signal that single-region deployments no longer satisfy the needs of global businesses.
To comply with increasingly complex regulations, you’re left to choose from some unattractive option:
…or you could consider an option that actually presents upside to your team:
Not everyone faces these concerns with equal dread. Those who can pay these costs (fines, downtime) often do, and avoid the headache of re-architecting an app from the ground up. Small, pre-revenue startups who are still trying to establish a user base (let alone an international one) can sometimes ignore these concerns––though, if your company succeeds, you’ve only put off handling a problem that becomes increasingly costly to solve later. Refactoring an app to use a performant distributed database can be dramatically more expensive than making choices with your company’s future in mind.
For everyone else––businesses of any size who are concerned with the experience of their users across the globe (or even across a single country)––multi-region deployments improve crucial elements of your business.
As we mentioned at the top of this post, there have been many attempts to overcome the obstacles of deploying a database to multiple regions, but most solutions make difficult-to-accept compromises.
Managed and Cloud databases often tout their survivability because they run in “multiple zones.” This often leads users to believe that a cloud database that runs in multiple availability zones can also be distributed across the globe. However, they elide an important and misleading fact: these zones are all in the same region, and don’t have the speed or availability benefits of multi-region deployments.
There are caveats to this, of course. For example, with Amazon RDS, you can create read-only replicas that cross regions, but this risks introducing anomalies because of asynchronous replication: and anomalies can equal millions of dollars in lost revenue or fines if you’re audited. In addition, this forces all writes to travel to the primary copy of your data. This means, for example, you have to choose between not complying with GDPR or placing your primary replica in the EU, providing poor experiences for non-EU users.
NoSQL was conceived as a set of principles to build high-performing distributed databases, meaning it could easily take advantage of CDN-like multi-region deployments. However, the technology achieved this by forgoing data integrity. Without consistency, NoSQL databases are a poor choice for mission-critical applications.
For example, NoSQL databases suffer from split-brain during partitions (i.e. availability events), with data that is impossible to reconcile. When partitions heal, you might have to make ugly decisions: which version of your customer’s data to you choose to discard? If two partitions received updates, it’s a lose-lose situation.
Inconsistent data also jeopardizes an application’s recovery point objective (i.e. the point in time of your last backup). If your database is in a bad state when it’s backed up, you can’t be sure how much data you’ll lose during a restore.
That all being said, if an application can tolerate inconsistent date, you get more of NoSQL’s benefits when it’s distributed across regions and close to your users.
Sharded relational databases come in many shapes and suffer from as many different types of ailments when deployed across regions: some sacrifice replication and availability for consistency, some do the opposite. Many require complex and fragile configurations, and others require you to tie applications to their enterprise offerings (which may or may not support multi-region deployments). With all of these trade offs, they pose headaches and risks when geographically distributed.
One option to consider for a multi-region database is CockroachDB, which uniquely meets all of the requirements for a multi-region deployment. CockroachDB lets you deploy to multiple regions and ensure all data is kept close to all users, and doesn’t lock you into a specific vendor (or the places they have data centers). It offers stronger consistency and lets you control where your data is laid out with our geo-partitioning feature. If you’re using a multi-region deployment, be sure to check out our multi-region deployment topology docs.
Don’t want to deploy it yourself? Use our database-as-a-service, Cockroach Cloud.
As we’ve shown above, there are lots of options available for when your applications need to provide speed and availability to a global audience. When shopping for a distributed database or evaluating your options for scaling, remember to keep the following characteristics in mind. A truly distributed database needs to:
To get stand up your own multi-region CockroachDB deployment, start for free with our SQL API in the cloud.
Imagine this: you work in system architecture for a multibillion-dollar consumer-facing business, and it’s the middle of …Read More
Slow applications kill business. Greg Lindon (in this now archived deck), noted that 100ms in latency lowered Amazon’s …Read More