How Stake eliminated the risk of downtime for global high frequency trading with CockroachDB
4 hour migration
3 distributed regions
It’s no secret that the U.S. equity markets are the largest in the world and continue to be massive wealth-generating engines, representing over 40% of the $108 trillion global market cap. But if you aren’t based in the U.S., how do you get access to these markets?
Typically you need to have connections, a broker, and money — a lot of money — which renders these markets inaccessible for many. Stake was founded to give Australians access to NYSE and NASDAQ stocks without a brokerage fee. Over the past five years since their founding in 2017 they have further expanded their global presence to Brazil, New Zealand, and the United Kingdom.
Today, they give the 450K+ people using their online platform access to over 8,000 stocks in the U.S. and Australia. Stake has the ability to easily scale to ANY market across the globe because they are building with CockroachDB.
As an online trading company, it’s crucial that Stake is able to provide its customers with the right data at the right time so they can make informed decisions with their money. It’s also worth noting that about 80% of their customers are not your average day traders: they are aggressive and they are advanced. For example, they will place trades at the last minute that need to happen immediately before the market opens the next day. If they experience any lag time or performance issues whatsoever, Stake will definitely hear about it.
Remember the GameStop short squeeze in January 2021? The event was triggered by a lot of online controversy around the company, which resulted in major financial consequences for certain hedge funds and a large loss for short sellers. It also greatly impacted online brokerage services — including Stake.
At the time, Stake was running Amazon EC2 and RDS for PostgreSQL. They were seeing massive amounts of traffic to their platform and they vertically scaled the main Postgres database running in RDS a couple times — until the point they hit their absolute max and the database gave up. Over the next 72 hours straight, Stake’s Head of DevOps Adrian Hannelly and his team were glued to their computers, frantically refactoring code to try to make the platform run faster because they could no longer scale.
They had to build memory stores in Redis in about 2 hours and quickly bring them online. They had to change the authentication system within 30 minutes. All of these types of changes typically take around 6 months to do, but the Stake team was doing it in a matter of hours.
Even so, a couple of times Stake’s platform went completely offline because the team just couldn’t keep up. The media frenzy following the GameStop controversy started to accuse Stake of not allowing people to trade which simply wasn’t the truth. They certainly wanted to be online because they were also making a lot of money along with the traders! The platform simply couldn’t handle the volume.
Once the dust settled, Stake did a post mortem on what happened, why it happened, and how to avoid repeating the problem in the future. They found that the database was the cornerstone of all their issues. They knew they could write embedded code, but there’s only so much they could do to make queries faster and eventually they would just run out of headroom.
Compounding this problem was that the company had aggressive expansion plans. They realized the situation would only get worse as they grew their business, resulting in more frequent outages. The Stake team began researching solutions that could address their top priorities:
At first they considered just building the pieces themselves and syncing the separate systems together. But they ultimately decided against this, because they weren’t in the database business. This wasn’t a product they would ever sell and not anything that their customers would ever care about. It made sense to find a vendor that could do this for them — and their search led to CockroachDB.
“We kept vertically scaling our Postgres databases until they would hit their max and just give up. We spent 72 hours straight cleaning up the mess. No sleep or anything. Just hours and hours of refactoring code so it would run faster because we couldn't add more databases behind this thing. It just wouldn't work.” - Adrian Hannelly, Stake, Head of DevOps
CockroachDB delivered all the features Stake was after, with a managed service offering which would reduce operations for their relatively small team. Most importantly, CockroachDB was Postgres-compatible, meaning a lot of SQL and libraries would not have to be rewritten. The migration would basically be a straightforward lift and shift. Ultimately, Stake reports that they only had to do a 5% rewrite to make sure they had the right indexes in place.
This is how it happened: First, they took the app down and dumped out the Postgres database. Then they imported the Postgres databases and command line straight into CockroachDB. Since CockroachDB is a distributed database and uses the RAFT consensus algorithm (requiring at least 3 nodes) to ensure consistency, they needed to make sure their indexes were in the right place. If they didn’t have indexes to control which node and disc to fetch data from, the database would do a full table scan (on massive amounts of data) which would take a long time and create latency.
The team also decided to move their smaller deployments first, then finish with Australia as their biggest. When they were running Postgres, they had three separate instances running in three different regions. At the time, they thought it was easier to keep everything separate to comply with GDPR and PII data laws in South America and Australia. This approach does simplify things, but at the cost of running in silos and with complex data-synching requirements.
CockroachDB, on the other hand, lets you extend your application across multiple regions while still functioning as a single logical database. Once Stake had all three regions up and running on CockroachDB, they were able to merge their data and get a global, single view of all of their data. Moving to CockroachDB also solved Stake’s compliance challenge by allowing data to be pinned to a particular region. For example, if a set of data identifies you as a person (address, social security number) it will be locked to that particular region. Other, unidentifiable data like login or passwords can roam because this doesn’t breach data privacy laws.
The Stake team reports that they were able to move their production application on one Sunday afternoon and it only took them about 4 hours. Aydrian’s biggest piece of advice: Do a few migration rehearsals before to make it a smooth process the day of. More specifically, walk through every single function you have 3-4 times and compare it to where it was before. Look at what the SQL is doing — tune, rinse and repeat. That way, the live migration will just be three easy steps: export, import, and create index.
“The ability to control where data resides is truly one of the unsung features of CockroachDB. It is so complicated with other systems – CockroachDB gives you flexibility and complete control so you can adhere to all sorts of global data regulations.” - Adrian Hannelly, Stake, Head of DevOps
When Stake originally started building their product, they wrote the application in Java and they wrote it fast! As with many startups, they wanted to go from concept to MVP very quickly. So they took what they knew and ran with it — ultimately creating a rigid monolithic application that became very hard to modify and created unnecessary technical debt.
Once Stake’s MVP found solid product-market-fit, it was time to grow their engineering teams. It was also time to move to a more modern approach for building applications via microservices and Kubernetes, and coding in Go. The shift allowed the teams to become more efficient and also enabled new engineers to get up to speed quicker.
Today, with the addition of CockroachDB, Stake’s platform rests upon a top-to-bottom cloud native technology stack. They are now in the process of adopting a multi-cloud strategy, which will allow them to achieve cloud portability (the ability to move apps and data from one cloud computing environment to another with minimal disruption).
Right now, 80% of their current load is in the AWS regions and they are slowly transitioning data to GCP. Because CockroachDB natively supports multi-region and multi-cloud deployments, this process is fairly straightforward and will also greatly increase operational resilience.
Stake is using CockroachDB to support two critical applications: their authentication system and their financial ledger.
Since they are a financial services company, every call in the system has an authentication token; CockroachDB supports the core system where all the tokens are stored. For example, let’s say one of Stake’s customers logs in via the mobile app in Sydney, and then hops on a flight to Sweden. Because of CockroachDB’s distributed nature, the authentication tokens are able to sync around the world. So when this user opens the app in Sweden, it functions exactly the same without any latency — a fast and seamless experience that is extremely important to Stake’s customers as they track the market and their investments.
Stake also uses CockroachDB as a backend for their own ledger tracking the money and buy/sell orders that flow in and out. This is a massively complex system that requires high availability and data correctness. CockroachDB also lets them scale quickly and selectively when a particular region is experiencing spikes in traffic. For example when there’s a lot of activity in London, but not Australia, they can add nodes in just London — and CockroachDB will automatically distribute the workload among the nodes.
Stake’s key business goal is to go fully global and not limit their reach to certain regions. If they get interest from users in any country, they can stand up their business in that geographic region. Now that they have a top-to-bottom distributed system, global expansion will be significantly easier.
The team also feels confident in their ability to avoid a GameStop-type disaster in the future. Stake’s system is extremely resilient and spread out across multiple regions and clouds. They are looking forward to adding crypto trading to the platform within the next few months.
If you want to build a high frequency trading platform and work with the coolest tech (CockroachDB!), Stake is hiring (remote) engineers across the globe: https://hellostake.com/au/careers