It’s the second half of a tense semifinal in the World Cup. England, who haven’t won the tournament in half a century, are deadlocked with Croatia at 1-1, and time is ebbing away. The pressure is mounting.
Then, the screen goes black.
That’s what happened to hundreds of thousands of Youtube TV subscribers in 2018, when the service went offline in the middle of one of the most important games of the tournament. Fans were (understandably) furious.
Youtube TV’s engineers managed to get the stream back up in time for the end of the game (which Croatia ended up winning in extra time). But viewers missed quite a bit of the second half, and could easily have missed the goal that determined the fate of both teams.
The outage highlights a lesson that’s important for any live video provider: you absolutely have to get your infrastructure right, because there are no second chances.
“No one’s gonna pause a large sporting event because some cloud infrastructure provider is down,” says Mux Co-founder and Head of Technology and Architecture Adam Brown.
“You don’t get to retry with live.”
In a recent webinar, Brown walked us through how Mux is trying to make providing great-quality video easy for developers all across the world while avoiding outages like the one that ruined some soccer fans’s semifinal experience in 2018.
Mux, which raised a US$105 million series D round this spring, offers two products. The first, Mux Data, provides high-quality performance metrics and monitoring so that online video providers can better understand what their users are seeing.
The second, Mux Video, is an API that enables developers to provide great live or on-demand video without having to do any of the heavy lifting themselves. Essentially, all a developer needs to do is provide the video file or feed, and Mux’s API takes care of ingesting, processing, optimizing for user device and bandwidth, and streaming.
“What Stripe did for payments, we want to do for video,” is how Adam Brown puts it.
This approach makes providing great video content very easy for developers. It also means that Mux has to grapple with a lot of potentially-expensive complexity behind the scenes. Dealing with video files means lots of storage, lots of compute for processing, and lots of spiky workloads as users tune in and out of livestreams. Any inefficiencies can lead to skyrocketing costs, a terrible user experience, or both.
And of course, any downtime can lead to users missing out on experiences they’ll never get back.
Luckily, Mux’s engineering team knew a database that could handle the challenging requirements of live video.
CockroachDB was immediately appealing to Mux’s engineering team because of its high availability (HA) and its support for multi-cloud and hybrid cloud deployments.
CockroachDB’s distributed nature means that databases can be configured to survive node outages and even regional outages. This was critical for Mux, which has a company-wide mandate that all services must be able to survive node outages without any kind of user impact. Meeting that mandate was easy: “We were already there with Cockroach,” Brown says.
The Mux team knew CockroachDB was resilient – that was a big part of why they chose it – so Brown says it was a “nice surprise” when they discovered that it is also very easy to work with. “The biggest thing that we loved about Cockroach here [at Mux] was the operational simplicity it brought to an HA transactional data store.”
“We’ve had a really good time with upgrading versions, doing our rolling nodes kind of on-demand, not having to orchestrate around it or think about it,” he says.
As a result, “Cockroach has really become our default,” Brown explains. “We use it anywhere where we would have traditionally said, ‘Let’s bring in Postgres.’”
The Mux team still uses Postgres too, Brown says, but he adds: “We’ve never been really happy with the HA story from Postgres. [With Postgres] you can have mostly instantaneous failover, but then operationally it becomes complex: failing back over and managing, and then you get into regional failures and managing upgrades. It just gets really complex.”
CockroachDB provides the consistency and familiarity of Postgres while making high availability – and even zero RTO/RPO – much easier.
Cockroach’s support for multi-cloud and hybrid cloud deployments is also critical for Mux. With live video, Brown says, “you really only get one shot.” So it’s not just about choosing the cloud provider with the most uptime; you also have to think about redundancy and workarounds in case something goes wrong.
“Let’s say our cloud provider goes down,” Brown says. “You want the ability to, even if it has a disruption, at least be able to get back up and running as fast as possible. The world’s not gonna wait for the live streaming provider.”
“We want to be able to sell that we’re able to survive [cloud provider outages] and really guarantee you reliability and performance,” Brown says.
Of course, cost and performance are also factors. “We’ve absolutely had more issues in some areas with one cloud provider versus another,” Brown says. “So having flexibility is a key component there.”
Today, Mux’s applications are built with Golang and gRPC microservices running on Kubernetes clusters deployed across AWS and GCP. But in the long term, Brown says, Mux will probably also have to develop its own specialized hardware to handle video transcoding more efficiently.
“If you look at YouTube, for example,” Brown says, “they are out there developing their own transcoding hardware. We’re going to have to compete with the cloud providers with our own hardware, one day. Not in the short term, but we want to be prepared for that. So flexibility across [multi-cloud and hybrid cloud] is really important.”
Although Mux is using CockroachDB in a variety of places, Brown says that its biggest role is as the database for all of Mux’s video metadata.
When a video is ingested into Mux, a CockroachDB database stores metadata such as the video resolution, file storage locations, and customer data. It also records the file’s state – for example, whether or not it has been transcoded, whether or not it’s ready for playback. This is already set up in a multi-region deployment, Brown says, and it’s soon to be multi-cloud as well.
Currently, Mux’s largest regions are in the US. As it spins up more European clusters, Brown says, it expects to make more use of CockroachDB’s regional partitioning. But for the time being, he says, “we’ve very successfully used the Follow-the-Workload strategy that Cockroach has.”
Follow-the-Workload is CockroachDB’s default behavior for tables in multi-region deployments that don’t have a table locality. It automatically distributes tables across nodes and regions to ensure survivability in the event of node or even region failure, locating tables in the region where they are most active to keep read latency low. “For us,” Brown says, “for this multi-region strategy where we want to be able to quickly failover to another region, having that workload automatically move to the region that is now promoted to the master for that piece of content is just really convenient.”
“I guess the simplest way to say it is that it has just been really hands-off,” he says. “We kinda turned it on, and it has just worked really well for us.”
Mux is actually using CockroachDB across a variety of use-cases, and that’s a trend that’s set to continue. “We’re moving a lot of our other projects that are still on Postgres to Cockroach,” Brown says. “For example, our customer API level, access keys, and customer organization data is still in Postgres today. But we plan to move all of that to Cockroach as well.”
From his perspective, Brown says, there’s not many reasons not to use CockroachDB over Postgres, considering that CockroachDB Core is free and open-source. “We do use the enterprise version of Cockroach for some of our deployments, he says. “But we actually don’t for some others. Getting off the ground with the open-source version is really great from a developer perspective.”
“We have developer environments that come up in Kubernetes and all of that is orchestrated very cleanly,” he says. “Postgres is pretty good there too, but once you get into the full deployment, just adding more nodes with CockroachDB is so simple. The operational simplicity is there. If Cockroach fits your needs, you should try it, because it makes it really easy to go from one region to two to multiple.”
One thing Brown says Mux did right was build its video product as multi-region from day one. “From past experience, [we knew] it’s very hard to go from one region to two regions,” he says. “Going from two to three gets a whole lot easier. We knew very early that success for this product looked like global scale. Really thinking about a multi-region strategy from day one was very successful for us.”
In other words, when you know that success for your application is going to require scale, why not build for scale from the beginning? “CockroachDB made it really easy for us to build that into the product from day one,” Brown says.