Delivering low latency across regions is a hard problem to solve. Maintaining high availability when an application is spread out all over the world is also hard. Starburst is only a five-year-old company but they’re already handling exabytes of scale across a five-region deployment; keeping read-latencies low everywhere.
In this blog we’ll explore why they made the multi-region investment, how they architected it to hit latency/survivability goals, and why multi-region is worth the cost. You can also watch this video of Starburst’s CTO, David Phillips, explaining their architecture and how they deploy CockroachDB (there are more CockroachDB customer presentations here.):
Starburst is the fastest and most efficient SQL-based MPP query engine for data lake, data warehouse, or data mesh. They offer an enterprise and cloud solution that addresses data silos and speed of access problems for their customers. Starburst has its origins in Trino (formerly known as Presto), the open source query engine that allows users to query data across distributed data sources and aggregate the results.
Customers from all over the world use Starburst to get more value out of distributed data because the data in Starburst is fast and easy to access, no matter where it lives.
Because of their global customer base Starburst decided to take on the challenge of building an application that could span across multiple geographies, achieve massive scale, and ensure 24/7 uptime. This is why they built Starburst Galaxy on CockroachDB.
It’s important to understand that Starburst’s architecture is massively decentralized because their customers are from all over the world and their customers’ data is distributed. This kind of wide-scale global distribution creates a unique challenge: how do you build a centralized product with a distributed architecture?
Since Starburst runs Trino in multiple clouds and in multiple regions, they need their customers to be able to access their data in multiple clouds and regions. This requires elegant answers to two questions:
They also needed:
Given all these requirements, Starbursts’ options were limited. They didn’t want to use Spanner because they didn’t want to be tied to a cloud provider. They could have used Postgres, but they would have had to replicate it themselves and figure out the supporting tech for multi-region deployments.
And they didn’t want to write it from scratch. They needed a cloud-agnostic solution that solved the multi-region challenge for them, which led them to CockroachDB.
“Why would I have my engineering team try to solve the multi-region problem when there’s an engineering team out there that’s already built a solid solution? We needed to make defensive, smart technology solution choices because we are on the hook for our customer’s data.” - Ken Pickering, VP of Engineering
With customers in the United States, Europe, Asia Pacific, and South America, the danger of a slow platform was a primary concern. To solve for that danger they use mostly read-heavy tables (a luxury of choice that not every application has). Ideally, when a customer queries their data, they hit their local read node and do not traverse the world to access the data. This means that their reads need to be near-instant, while writes can be somewhat slower.
Given the nature of this data, they decided to leverage CockroachDB’s global table locality to ensure that read latency is really low. They were willing to compromise write latency in this scenario (excellent read here on Global tables if you’re curious).
Prior to going into production, Starburst worked with the Cockroach Labs team to determine the best scenario for minimizing latency. For example, they could have deployed their application across 4 AWS regions spread out among us-west-1, us-east1, eu-central, and ap-northeast-1.
However, their survival goal = region failure, meaning that if an entire region went down (and assuming the app in that region is also down) their customers would be able to access their data from a replica in a region nearby. Given this survival goal, it was best to rearrange their setup and add an additional region.
In the event that a region goes down, they have a plan:
Since Starburst went into production in November 2021, their approach to achieve their survival goals and minimize latency for their customers has been successful.
“We have to assume that at some point, some of the regions will go down. We can’t have central gravity for this storage – it’s not possible. So even if our primary control plane goes down and a failover hasn’t happened, the customers can still get access to their data.” - Ken Pickering, VP of Engineering
For more insight into Starbursts’ thoughts on multi-region, multi-cloud architecture as well as their plans to continue separating storage from compute check out this webinar.
Their VP of Engineering, Ken Pickering, shares his insight on the challenge of running systems in different cloud environments, building portable architecture, and why multi-region/multi-cloud use cases are a reality for anyone designing highly available systems for a global user base.
Thanks to services provided by AWS, GCP, and Azure it’s become relatively easy to develop applications that span …Read more
The details in this post are based on The Netflix Tech Blog post titled “Towards a Reliable Device Management Platform”. …Read more