Featured Blog Texture

Blog

Engineering

RAFT 2

Engineering

Raft is so fetch: The Raft Consensus Algorithm explained through "Mean Girls"

Raft is a consensus algorithm used in distributed systems to ensure that data is replicated safely and consistently. That sentence alone can be confusing. Hopefully the analogy in this post can help people understand how it works. In honor of national Mean Girls day (“on October 3rd he asked me what day it was”), I present the Raft Consensus Algorithm as explained through the movie Mean Girls. (For a great, more technical overview of Raft, we recommend The Secret Lives of Data).

Mikael Austin

Mar 13, 2024

HighAvailabilityinCockroachDB blog art by ChristinaChung-1

Engineering

SIGMOD 2020: Cockroach Labs publishes research paper on CockroachDB

Over the past few months, a team of our engineers, technical writers, product managers, and sales engineers codified the research and learnings of CockroachDB and are now contributing this knowledge back into the very system from which we have benefited with the hope of further advancing distributed systems research and design. The research paper, "CockroachDB: The Resilient Geo-Distributed SQL Database", is a labor of love that we are honored to have published by SIGMOD, the Association for Computing Machinery's (ACM) Special Interest Group on Management of Data, which specializes in large-scale data management problems.

jessica headshot

Jessica Edwards

June 10, 2020

Choosing-Index-Keys

Engineering

How online primary key changes work in CockroachDB

As of our 20.1 release, CockroachDB supports online primary key changes. This means that it is now possible to change the primary key of a table while performing read and write operations on the table. Online primary key changes make it easier to transition an application from being single-region to multi-region, or to iterate on the schema of an application with no down time. Let’s dive into the technical challenges behind this feature, and our approach to solving them. As part of the deep dive, let’s review how data is stored in CockroachDB, and how online schema changes work in CockroachDB.

Rohan Yadav

May 21, 2020

multi-region-how-by-rebekka-dunlap-1

Engineering

Tutorial: How to build a low-latency Flask app that runs across multiple regions

If your company has a global customer base, you’ll likely be building an application for users in different areas across the world. Deploying the application in just a single region can make high latencies a serious problem for users located far from the application’s deployment region. Latency can dramatically affect user experience, and it can also lead to more serious problems with data integrity, like transaction contention. As a result, limiting latency is among the top priorities for global application developers. In this blog, we walk you through a low-latency multi-region application that we built and deployed for MovR, a fictional vehicle-sharing company with a growing user base. The MovR application is a Flask web application, connected to a CockroachDB cluster with SQLAlchemy. We deployed the application in multiple regions across the US and Europe, using Google Kubernetes Engine, with some additional Google Cloud Services for load balancing, ingress, and domain hosting. For the database, we used a multi-region CockroachDB Dedicated deployment on GCP. The application source code is available on GitHub, and an end-to-end tutorial for developing and deploying the application is available on our documentation website.

Eric Harmeling

Apr 2, 2020

cross-cloud-deployment-by-zoe-van-dijk-1

Engineering

How to run chaos tests in a multi-cloud environment

This year, as every year, Black Friday and Cyber Monday stressed e-commerce systems to their breaking points. Major companies like H&M, Nordstrom Rack, and other retailers experienced the kinds of costly outages that keep SREs up at night. Multi-cloud infrastructure is sometimes offered as a panacea to these kinds of outages. But multi-cloud deployments are not a band-aid. In fact, they often introduce new complexities into the system that need to be sniffed out. But sniffing out bugs in multi-cloud environments is, by nature, complicated. Ana Medina, a chaos engineer from Gremlin, spoke at ESCAPE/19 about how to do it, including a detailed list of the kinds of errors to search for and checklists of questions to ask.

1536574967915

Dan Kelly

Dec 9, 2019

Availability 4

Engineering

Availability and region failure: Joint consensus in CockroachDB

At Cockroach Labs, we write quite a bit about consensus algorithms. They are a critical component of CockroachDB and we rely on them in the lower layers of our transactional, scalable, distributed key-value store. In fact, large clusters can contain tens of thousands of consensus groups because in CockroachDB, every Range (similar to a shard) is an independent consensus group. Under the hood, we run a large number of instances of Raft (a consensus algorithm), which has come with interesting engineering challenges. This post dives into one that we’ve tackled recently: adding support for atomic replication changes (“Joint Quorums”) to etcd/raft and using them in CockroachDB to improve resilience against region failures.

Tobias Grieger

Nov 26, 2019

Parallel-Commits-01-1

Engineering

Parallel Commits: An atomic commit protocol for globally distributed transactions

Distributed ACID transactions form the beating heart of CockroachDB. They allow users to manipulate any and all of their data transactionally, no matter where it physically resides. Distributed transactions are so important to CockroachDB’s goal to “Make Data Easy” that we spend a lot of time thinking about how to make them as fast as possible. Specifically, CockroachDB specializes in globally distributed deployments, so we put a lot of effort into optimizing CockroachDB’s transaction protocol for clusters with high inter-node latencies.

Nathan VanBenschoten

Nov 7, 2019

SQLSmith Header IMG 2

Engineering

SQLsmith: Randomized SQL testing in CockroachDB

Randomized testing is a way for programmers to automate the discovery of interesting test cases that would be difficult or overly time consuming to come up with by hand. CockroachDB uses randomized testing in many parts of its code. I previously wrote about generating random, valid SQL. Since then we’ve added an improved SQL generator to our suite called SQLsmith, inspired by a C compiler tester called Csmith. It improves on the previous tool by generating type and column-aware SQL that usually passes semantic checking and tests the execution logic of the database. It has found over 40 new bugs in just a few months that the previous tool was unable to produce. Here I’ll discuss the evolution of our randomized SQL testing, how the new SQLsmith tool works, and some thoughts on the future of targeted randomized testing.

Matt Jibson

June 27, 2019

highavailabilityincockroachdb blog art by christinachung-1

Engineering

High availability without giving up consistency

If you’re reading this, you’re surely familiar with the arguments for high availability: services are only useful when they’re online. Unavailable services not only lose money, but also deteriorate your credibility in customers’ eyes. This could lead to immeasurable costs to your company in the future. Given that CockroachDB got its name because of its ability to survive failures, we thought we would cover some architectural considerations when building high availability services on top of Cockroach.

Sean Loiselle

Augt 23, 2018

kubernetes-part1-zoe-1

Engineering

Kubernetes: The state of stateful apps

Over the past year, Kubernetes––also known as K8s––has become a dominant topic of conversation in the infrastructure world. Given its pedigree of literally working at Google-scale, it makes sense that people want to bring that kind of power to their DevOps stories; container orchestration turns many tedious and complex tasks into something as simple as a declarative config file.

Sean Loiselle

May 1, 2018