Come Work on CockroachDB in Sydney, Australia!

G’day! I’m Oliver, a Member of Technical Staff here at Cockroach Labs. After spending the better part of 5 years in the United States, I decided to come back home to Sydney, Australia. With my homecoming, I’m happy to announce that Cockroach Labs is hiring people to work with us from Sydney, Australia!

If you’re curious about me, my journey, and why I’m at Cockroach, read on. Of course, if the news of an opening sounds good, you can jump straight to the job openings we have in Sydney.

My journey before Cockroach Labs

Born and raised in Sydney, I graduated with a degree in Computer Science from the University of New South Wales. During my university years, I did an internship at Google in Sydney, Australia, where I worked on tooling for detecting network hardware failures on the NetSoft team. I also worked at Facebook in Menlo Park, California where I built tooling for diagnosing traffic infrastructure issues behind internet.org.

After graduating, I went to work at Dropbox in San Francisco, where I was on the filesystem team. The filesystem team was responsible for the abstraction that handled the metadata associated with syncing your files. This data was stored on thousands of MySQL shards managed in-house, altogether containing trillions of rows of data.

One of my main projects involved moving the filesystem abstraction away to its own service. This enabled desirable traits, such as centralised rate limiting. It also ensured we had controlled, well behaved SQL queries to the database. This was a long running multi-month project that involved moving calls away from multiple services into one. Of course, there was reasonable effort spent to ensure the new service served traffic at reasonable latency, and with no downtime compared to speaking with the database directly from our monolithic server. The service today serves over a million requests per second with high availability.

To ensure we had a well performing service, we needed to ensure the APIs provided by our service could serve traffic with reasonable latencies. To do this, we had to ensure we had performant queries for each API endpoint that we offered. While designing our APIs, we looked at various callsites and found unoptimised queries which could devolve into large table scans - some of which had caused reasonably large downtime. Tracking things down during a downtime was terrible - trying in a mad panic to find the source and then blocking these queries from execution was quite stressful.

To ensure our queries were performant, we turned to a few techniques:

Pagination of queries - We had some calls which could request and scan millions of rows. By ensuring each SQL query was constrained to read a maximum number of rows on each call, we were able to reduce a lot of load on the database.
Adding indexes to support faster lookups - Some of our common queries turned out to not be backed by an index, meaning they scanned excess rows and took longer than they should have. Where it made sense, new indexes were added to improve performance.
Performing offline work when able - Some queries required heavy computation on the live path, such as file searching. If search results didn’t have to be totally consistent with live state, we could migrate file search queries to utilise a separate search infrastructure that could process results offline, and store these results in a database that was friendlier to search type queries. That way, we could handle file search queries on the live path in an optimised fashion.
Denormalising certain attributes to speed up access times for commonly answered queries - We recognised that certain read queries could be sped up significantly if we did some extra calculations and storage at write time. For example, one query we commonly had to answer was “what is the total size of all the files in your Dropbox”. Doing a `SELECT sum(size) …` over your entire Dropbox would require reading all the rows on your Dropbox account, a potentially large table scan! But if we added or subtracted from a separate denormalised value on a separate table every time a file is added or deleted (e.g. UPDATE denormalised_table SET size = size + delta …), this query could be answered quickly from a single row read.
Of course, some queries just needed to be fixed to utilise indexing by forcing indexes or rewriting them!

Another challenge we faced was having data that didn’t follow invariants we assumed to be true. This is especially painful if data is denormalised incorrectly, hence serving incorrect data. In the denormalisation of size case mentioned above, we had users with their Dropbox size deemed to be a higher value, meaning they couldn’t upload any files as they seemingly exceeded their quota - even when their Dropbox was completely empty! Broken invariants like these led to a broken user experience, culminating in support tickets or customers simply walking away from your product (most people just leave your product without filing a ticket if it is broken!). Unfortunately, there were lots of broken invariants hiding, which we couldn’t find without performing expensive SQL scans and joins across the database!

To detect broken invariants, we built a verification system that regularly scanned every row in our database to ensure our data was always consistent. The system reads these rows in batches and from replica databases to minimise availability impact. We found that we had millions of inconsistencies in our data, and the job to resolve these and bring these incorrect invariants down to zero involved much labour. However, the reward of building such a system and ensuring zero inconsistencies was significant, as going forward we were able to make changes and add new features with a greater deal of confidence.

Another effort I was part of involved the migration of billions of rows from our global single-shard MySQL database to a different database called Edgestore, an in-house graph-based database built on top of sharded MySQL with its own API. This was necessary to support the rising user base and to avoid the dreaded single point of failure. However, it was not straightforward - but more on that later!

As you can see, a few projects - mostly working on top of large scale database systems. Did it have any bearing on why I joined Cockroach? Well…

How did you like working with large scale database systems?

My experiences at Dropbox were mostly spent working with in-house database systems that continued to work at scale with high availability. Having been involved in the projects above, there were lots of technical and operational challenges involved:

Seemingly simple tasks such as creating a new index or adding a new column were tricky and surprisingly lengthy, especially if we wanted no observable customer impact (availability hits, performance changes or incorrect results) when performing these operations. These operations needed to be orchestrated in a complex order. In our setup, each shard contained a leader with two followers serving as backup. To action any schema change, we needed to apply the change on each follower, before promoting a follower to the leader. As each promotion resulted in a small availability hit, these promotions needed to be done one at a time. Since shards complete the schema change at different times, performance or correctness between these periods may be affected as shards may have indexes or columns that others don’t. As a result, this was a slow, heavily babysat shard-by-shard process. Schema changes ended up taking several weeks and multiple engineers to pull off.
Having data sharded into separate MySQL databases meant we had to give up some powerful features built into the database to enforce invariants. For example, we couldn’t have

foreign keys

as MySQL was not able to enforce validation across different shards. This means we missed out on a cheap way to avoid the inconsistencies we found beforehand. We were only able to find and enforce these invariants via a large effort on building a verification system for every relation. Another example is giving up “ON DROP CASCADE”, which would have made tasks such as deleting all data related to a user much simpler. Instead, we had to invest in systems to vacuum and purge user-related entries.
There was tremendous difficulty in committing to multiple shards without transactionality. Before the advent of

cross-shard transactions

on Edgestore (which came after some migrations were completed), many teams found that working in a world where commits could be made on one shard but not on another were tricky. This could have gnarly effects for the end user, in which some data may appear to be missing for short periods of time. Engineers needed to write large scale consistency scripts and auto-fixers to correct any inconsistencies in this new model, making for more complex code to manage this state across the stack.
Introducing a new database API also had huge costs. In the case of Edgestore, teams needed to migrate from SQL to a custom API with a restricted KV-style interface - variations of Get, Set and List. This required all database statements to be completely re-learnt and re-written (which got even more tiring if there were complex joins unsupported by the new API). Combined with changes in transactional guarantees, these migrations meant lots of work and long periods of validation, with each relation taking at least several weeks to migrate.

In the case of Edgestore, the migration efforts were a multi-year, full organisation effort. Work was needed to create a database system from scratch and keep it up and running under scale, compounded by needing every relation to be migrated. Maintaining our in-house database systems was also an operational burden, requiring dedicated engineers to perfect these in-house systems as they matured.

Though hard and in some cases arduous, I found the database work I was doing enjoyable as it was the kind of large-scale infrastructure problems I was interested in solving. I felt fortunate to have worked with many talented engineers at Dropbox who managed to pull off such complex technical projects to keep us operational at scale.

What made you decide to join Cockroach Labs?

Almost every company needs a database to house their data. Successful companies need databases that grow with their success. But not every company can afford to spend the amount of resources that companies such as Dropbox, Facebook or Google do on databases so they can survive massive growth.

When I saw that CockroachDB had the power to abstract these problems away while still using the PostgreSQL syntax developers know and love, I was immediately sold. For me, it was the database that grew with you. If I could contribute to a product that took these pain points away, I would be helping others focus more on their mission of shipping their own product instead of worrying and reasoning about complex database-related issues at scale. In that sense, I felt I would be a part of every product that would be shipped and powered by CockroachDB.

I applied straight away. Walking into the interviews, I already thought the product sold itself. I was even more impressed when I talked to everyone working at Cockroach. I was excited by the upcoming projects, the vision of the company and the people I talked to. Not long after that, I signed!

How do you like working at Cockroach Labs?

Working on CockroachDB has been an incredible experience. It’s been just over a year and I felt like I’ve been involved in so much - some highlights include dealing with the mindblow-yness of time, adding spatial features and indexing and simplifying our multi-region user experience in the upcoming release.

While there are many aspects that I enjoy at Cockroach, here are a few big ones (in no particular order) that keep me going:

The Product. As I mentioned before, it makes me proud to be working on such a powerful product that lets others innovate instead of spending time on their metadata storage solution.
The Tech. Databases are so massively cross-field and interesting. Sometimes you think you’ve seen it all - but I’ve been proven wrong time and again as I’ve seen some truly pioneering stuff whilst working here.
Being Open Source. In particular,

the community has made some outstanding contributions

and it’s scary to think how much further behind we’d be without them. Though I am worried that my wife likes to check up on me on Github!
The People. There is some amazing talent here at Cockroach and that knowledge is shared around well. Everyone has been inviting and open with their time - it’s been easy to jump on quick calls to debug or solve issues with one another. Furthermore, the people team have been great at making sure we are still all connected and thriving as an

intrusion

with social events and initiatives to keep us feeling together, even as we work remotely.

What brings you back to Sydney?

When the virus-that-must-not-be-named came along, my wife and I decided that it was time to move home. The US was a great adventure and scratched our traveller itch, but coming back home to the familiar suburbia of Sydney (and the wonderful fresh sea breezes) was always our long term plan. Of course, coming back was complicated, as I still wanted to be involved with Cockroach Labs but there was no Aussie presence yet.

Fortunately, the team at Cockroach Labs was interested in the talent that could be tapped in Australia and I was given the go ahead to come back and spin up a new office. We are already growing rapidly with a Series E round under our belt, which puts us in a great position to aggressively grow.

Come join us down here in Sydney!

I’m happy to be home, but I’d be even happier if you decided to join us down here in Sydney! We are currently looking for Site Reliability Engineers and Software Engineers to join our team and help build out the start of a new shiny office in the land down under.

You can see the current positions on our careers page - there will be more to come in the future. If you’re interested and are already in Australia, don’t hesitate to apply or reach out to me directly.