LISTEN ON: | SoundCloud |
Join the Cockroach Labs founders for an unscripted conversation about the dirty details building 1.0 and in achieving consensus across three co-founders. What do they wish they could have made it into the 1.0 release? What are they most excited about in this production-ready release? And what happens when they disagree with each other?
Peter: Hi, I’m Peter Mattis and welcome to our first-ever Cockroach Labs podcast. Today I’m going to have an unscripted chat with my co-founders about CockroachDB 1.0, the production-ready release of Cockroach. For anyone new to us, Cockroach Labs is the company building CockroachDB, an open source, cloud-native SQL database. We’ve got in front of us a stack of questions about the release and about the team building it that we’ll be cherry-picking our way through, unscripted-style. Let’s get started started with everyone introducing themselves and telling our listeners which inventor they would like to meet, living or dead.
Spencer: Hi. My name is Spencer Kimball. I’m the CEO and I’m also a co-founder at Cockroach Labs. If I had to choose an inventor, well, obviously Edison is a great one. I would say Nikola Tesla, just out of admiration for his ability to think amazing thoughts even though he was very misunderstood in his lifetime.
Peter: Hi. My name’s Peter Mattis. I’m a co-founder at Cockroach Labs and the VP of Engineering. For an intro question, which inventor, living or dead, would you most like to meet? I would choose Thomas Edison. I really believe that phrase, the quote he has about invention is 1% inspiration and 99% perspiration. Having the idea is only a very small part of invention. Actually changing, transforming that into a reality is where all the work is.
Ben: My name is Ben Darnell. I’m a co-founder and also the CTO at Cockroach Labs, and Spencer and Peter took the obvious inventor choices. I’m going to go with an obscure one. I’m not sure how to pronounce his name exactly, but it’s Vannevar Bush. He worked in the ’40s and ’50s and developed a lot of proposals for things that presage the modern internet and hypertext and that sort of thing.
Peter: All right. How did we all meet? I met Spencer back in 1993 at Berkeley. He was actually my brother’s roommate in college, and I remember, I think, coming out to visit them or something and seeing his roommate, my brother’s roommate, Spencer, and I had seen him walk around in bare feet. That was my first memory of meeting Spencer, walking around in bare feet. Later on we actually had a number of classes, computer science classes together. The friendship blossomed from there.
Spencer: I’m pretty sure, since we were at Berkeley, I wasn’t in bare feet. I must’ve been in socks and Birkenstocks or something equally dorky. I remember my roommate James, who wasn’t a computer science major, he was mechanical engineering, and he was talking about what a great programmer his little brother Pete was. I was like, “Oh, yeah. Whatever. I’m sure he’s not that great,” and then when I met him, I was pretty blown away. Pete and I, actually the very first project we decided to work on was the GIMP, so that was back in ’93, ’94. I think we worked on it until ’97.
Ben: Yeah, so I met these guys a little bit later in 2002. I had just graduated college and started my first job at Google, and I got assigned to the same team as Spencer who was starting at about the same time, and we spent the next couple years working on a number of infrastructure projects at Google while Peter was also working there, a few cubicles over, working on Gmail.
Peter: All right. What does 1.0 mean for CockroachDB beta users?
Ben: The biggest difference for people who are already using our beta releases is that with 1.0, we’re committing to a lot of production readiness features including things like zero downtime rolling upgrades, backup and restore, and that sort of thing. The difference between 1.0 and beta is really about being ready for real production use cases.
Peter: There’s also, we’re committing to a release cycle with 1.0. we’re still going to be producing unstable releases every week or every two weeks, but there, we’re moving to a release cycle, an every six month release cycle where we release supported, stable releases. They get quite a bit more testing than those biweekly unstable releases get.
Peter: All right. What are you excited for in releasing 1.0?
Ben: I think the biggest thing is just that this product has been in development for so long. It’s been nearly three years now, since the first commit, and it’s been developed in isolation to a certain degree, which is strange because we’ve been so public about our development process, but this is where the rubber really meets the road and we get to see it in real world usage in starting to solve real problems.
Spencer: I’m really excited to see how Cockroach is used, and this is also, I think there’s another question in here which is about what’s terrifying us about releasing 1.0. Maybe these two things are two sides of the same coin for me, but the, Cockroach has a very aspirational technological intent. It’s inspired by what Google’s done with their database technology, and it’s meant to be both scalable and also very survivable, so highly available and there’s a lot of complexity in there, and it will be amazing to see it fulfill that potential, but also somewhat terrifying to discover how many unforeseen circumstances and use cases that the technology’s applied to. To be candid, I suspect many of those are going to run into trouble early on and that’s really what the 1.0 release to the 1.1 release is chiefly concerned with, which is really the ruggedization of the product. Solving the problems that come up as people use Cockroach for many use cases beyond the things that we originally tested it for.
Peter: For myself, I’m most excited about hearing success stories, people using Cockroach in production. I really want to hear that story about someone deploying a Cockroach cluster, realizing they’re reaching capacity and easily adding a handful of more nodes and having the system just automatically rebalance. On the terrifying side, it is, you put something out there and it will have problems. We’ve done a huge amount of effort to try to get ahead of those problems, and yet they’re bound to happen. The terrifying aspect is, what are those problems going to be that people report, and hopefully there are no disasters.
Ben: Yeah. The terrifying part of this product is that as a highly available, survivable transactional database, we’re making very strong claims about the consistency and durability of your data. We’re asking businesses to trust us with their most critical data. That means that the stakes are very high and we think we’ve got something that’s ready to handle those stakes.
Peter: All right, fun question, which technology do you wish had never been invented?
Spencer: That’s a tough one. This is Spencer here. I’m usually such a techno optimist that it’s hard for me to imagine a technology that I wouldn’t want to see on the world stage despite the fact that technology obviously has a double edged sword sort of effect. If there was one that I think we could pretty much do without, it would probably have to be nuclear, atomic bombs.
Ben: I’m thinking a little smaller here. I really hate talking on the phone, so I’m going to go with the voice telephone.
Spencer: All right, let’s move on. This is really about the kinds of engineering communications that go on inside any company where someone proposes a design or submits a PR for review, and people begin to argue about it. The term of art here is bikeshedding, so I’m curious what you guys think have been some of our fiercest bikeshedding debates.
Peter: I’m not sure about the fierceness of the debate, but we have gone back and forth about how to do logging and structured logging multiple times, and I feel that we still haven’t arrived at something that is completely satisfactory to all parties. By far, that has been the area of the code that seemed the most churn.
Ben: Yeah. Another related area is error handling. Go doesn’t have as clearly established conventions for error handling as other languages, especially when you need to deal with structure in the errors and be able to get information back out of them. We’ve had a series of changes that have, in some cases, literally gone back and forth and reversed each other in the way that we handle errors in a way that maximizes our ability to extract structure from them later on when we need to.
Spencer: Yeah, and somewhat related to that, we’ve had quite an odyssey over the two plus years of developing Cockroach around how we are able to stop a process in an orderly fashion, and Go itself has contributed to this debate with the addition of something called a context. We created our own mechanism called a stopper, and it has this sort of two phase shutdown process, and even that is very difficult to make work. I think that’s a sort of unresolved and continuing conversation.
Ben: All right. Here’s another fun question. Which nonexistent technology do you wish existed today?
Spencer: If I had to choose just one, what would be my favorite? I would have to say it would be great if we had brain uploading technology.
Peter: I knew Spencer was going to say that because he talks about that all the time. I’d actually like almost the inverse of that. I would like to be able to download knowledge quickly into my brain a la The Matrix and be able to get kung fu skills in a day.
Ben: Yeah. I would have to go with the Star Trek transporter.
Spencer: Great. Let’s move on to another founder question. This one actually looks pretty interesting. When have you disagreed with a founder who was right?
Ben: I don’t know. I only disagree with you guys when you’re wrong.
Spencer: Actually, I have the opposite feeling, which is I often disagree and I’m often the one that is wrong. I’ve actually learned over the years to censor my own immediate responses to things as much as I can and to dig a little bit deeper before disagreeing with these two guys because they’re often right.
Peter: Yeah. I can’t remember a specific case, but very frequently happens that Spencer comes in. He’ll have an idea to do something, and my initial reaction will be like, “No, we can’t do that. It’s too hard,” but I also try to temper that reaction and think about it for a while, and sometimes we come up with a way to actually accomplish that.
Spencer: Okay. Well, that was fun. Let’s get back to a question about our upcoming 1.0 release. All right. If you could fit one extra feature or capability into 1.0, what would it be? Maybe we should also think on the other side of this question. If there’s one thing you could take out of the release, what would it be?
Ben: My biggest disappointment in terms of things that didn’t make it into 1.0 is the situation with handling certificates and security. I think there’s, our security story is pretty strong as long as you’re using TLS certificates properly, but we haven’t provided the tools to make that as easy as I would like. It’s still a fairly arcane process to generate your certificates and distribute and manage them in a secure way, and I was hoping we’d be able to make that process smoother for 1.0, but it looks like that’s going to have to wait for the next release.
Spencer: I would’ve loved to get some performance enhancements that I have in mind into 1.0, particular, the speed at which we can drop and truncate tables. Right now we do that in a very naïve way that maintains all of the past states so that you can still query historically, and that’s difficult to do. There are clever ways to make that go very quickly, and they didn’t make it into 1.0, which we’ll have to correct in 1.1.
Peter: For myself, it is the troubleshooting tools. We have a number of troubleshooting tools built into CockroachDB, but there’s some standard ones that you expect to see in a SQL database. Listing running queries, canceling running queries, and then just some basic tools that apply more to CockroachDB than to other systems. The most frequent cause of having a problem that users encounter when starting up a new CockroachDB cluster is networking issues between their nodes that is, in some ways, outside of our purview. We can’t really fix the problems, but we can provide a tool to much more easily diagnose what is going on in those situations.
Spencer: Did you want to cover the thing that you’re unhappy about making it into 1.0?
Peter: I’m not sure if I’m unhappy, but we, over the past year there’s been an enormous amount of effort put into stabilizing the core key value transactional store, and while that was happening, we kept on adding features into SQL. I’m not really sure I’m unhappy about this, but there might have been, we might’ve put too much into the SQL side and maybe should’ve just diverted additional resources to stabilizing and improving the performance of the transactional key value store, but then this always encounters the problem with just adding more engineers onto a given part of the code base. There wasn’t really capacity to put more people onto the core of the system.
Spencer: Too many cooks in the kitchen?
Spencer: Yeah. It’s hard to answer that. There’s always a tension in terms of what makes it in and what doesn’t. I think we have a fairly good set of things in there. I think post-op we’ll be able to say, “Woo, maybe we shouldn’t have put that thing in there because there may be some area of the system that ends up causing a lot of problems for people in the 1.0 release,” but we’re crossing our fingers there. Let’s move on to a fun question. Do we have any fun questions?
Ben: If you were developing a project using CockroachDB, what technologies are you most likely to use with it? What would that stack look like?
Spencer: I don’t know if that’s a fun question. Well, one of the big reasons that I started working on CockroachDB in the first place was due to the last startup that Peter, Ben and I were all part of called Viewfinder, and we were building a very, what was meant to be a very large scale back end for a private photo sharing system. We built a pretty great one. It was based on DynamoDB so we didn’t have transactions and we had to work around that in fairly clumsy ways, and ultimately when we imagined what the system would need to be if it became something as successful as, say, Snapchat, we realized that DynamoDB would have lots of problems that we’d have to work around. This is a, I think this would be a wonderful example of an application that could use something like CockroachDB successfully.
Ben: Yeah. I have a side project where I’m the maintainer of the Tornado web framework for Python, and so that would definitely be the central part of my stack. As for what I’d put around it, I’m not sure right now because Tornado is an asynchronous framework, and most of the database frameworks in Python are synchronous, and so I don’t know. I may end up having to write that part of the stack myself. This always gets me with projects like this because I’m much more interested in building out infrastructure and frameworks and libraries than actually building the final application.
Spencer: Yeah, and it actually does remind me, there’s a long standing desire I’ve had since the Viewfinder days to build my own ORM. I think that I might get sidetracked before I actually ever built any kind of project using CockroachDB and I’d end up just building a, what I think is a better ORM.
Ben: Didn’t we already do that 15 years ago?
Spencer: Yeah, and that wasn’t exactly, in my opinion, a great ORM.
Ben: I don’t know. I still like the first version of it. The second version of it wasn’t as good.
Spencer: I wonder if that still exists at Google.
Ben: Yeah. I don’t think that’s ever dying.
Peter: I got a kind of fun question for you guys here, but it’s a founder question. What would you do if you weren’t an engineer?
Spencer: Well, I think I might like to take up science fiction writing.
Peter: You could always do that as a post career, hobby.
Spencer: Yeah. That’s something that may be in the cards for me. I certainly enjoy reading it, and sometimes I’m frustrated with the plot lines lacking any kind of likely, or fidelity with the likely future, put it that way. How about you, Pete?
Peter: Yeah. I don’t know. I have no idea what I would do if I wasn’t an engineer. It just seemed I was completely drawn, especially to computer science. Little bit of trivia, I actually entered college as a mechanical engineer thinking I would follow in my father and my brother’s footsteps, and I quickly determined that computer science was much more appropriate for me. Computer science was easy, mechanical engineering was super hard. I praise all mechanical engineers out there who can do the calculations and be good mechanical engineers. That wasn’t in the cards for me, but being a computer engineer seemed like, I’ve just been drawn to computers my whole life. I was definitely born at the right time.
Ben: Yeah, I’m similar to Peter. I was drawn to computer science very early on and even my main hobbies these days are also programming, so I don’t know. I don’t really have an alternative.
Peter: We’d be useless.
Spencer: You guys would be sitting at home, staring at the wall probably. It’s a good thing computer science was open to you as a career path. All right. I’m going to read this question because I can’t answer it, and this is about me and my dog named Carl, who is also a Cockroach Labs pseudo-employee. He doesn’t get paid here but he comes into work often.
Ben: He gets paid in snacks.
Spencer: He does get paid in snacks and beef jerky. The question is, who snores louder, Spencer or Carl?
Ben: I think the whole office can attest to how loudly Carl snores. I’m not sure if many of us know how loudly Spencer snores.
Peter: Yeah, well, I actually had a chance to room with Spencer at a recent company ski trip, and yeah. It’s a toss up there. Spencer was sawing logs there. The room was shaking like a train was going through it, which is interesting because I was also roommates with him back in college. Not just, we didn’t have separate bedrooms, we actually shared a room. I don’t remember any snoring back then. This is some kind of recent development.
Spencer: Carl, by the way, is a Boston terrier, so they have some trouble breathing cleanly. Let’s go on to a 1.0 question. Ben, you want to pick one?
Ben: Yeah, so we’ve been talking about CockroachDB and how it provides multi-active availability, so what is multi-active availability and why does it need its own new term?
Peter: Multi-active availability is a setup of a distributed system such that multiple nodes and multiple data centers are active and can process requests at the same time. The reason it needs a, there needs to be a new term for this is that existing terms are often completed in confusing ways. For example, if you just talk, there’s standard master-slave or primary-secondary systems, but they don’t really cover this setup. Then just saying active-active also is diminishing towards what we’re actually accomplishing here.
Spencer: Ben, you want to add to that?
Ben: Yeah, I think the big difference, why we’re using a new term, multi-active availability as opposed to the more traditional active-active availability is twofold. One is that active-active implies two replicas and CockroachDB uses at least three, and also in a traditional active-active setup, in most deployments that use this term today are inconsistently replicated. Each replica is serving traffic, but the differences between them are reconciled asynchronously. In CockroachDB, with its consensus based replication, everything stays consistent between all of the different active replicas.
Spencer: Yeah. We also wanted it to be a, indicative of an evolution. Active-active is certainly one sort of predecessor to this idea of multi-active, and high availability is the term that you generally hear in the industry when you’re talking about a system that provides better SLAs in terms of up time. We wanted to combine those two, multi-active because it’s better than, it requires more than just two and availability is part of this new term because it’s an evolution of high availability.
Peter: All right. What was a surprising challenge in releasing 1.0?
Spencer: Well, we had challenges on a number of fronts, but the one that was most surprising?
Peter: I will actually kick off the answer for this. To a certain degree we know about these technologies out there. RocksDB, which we’re using as the underlying local key value data store on Cockroach nodes. There’s Raft and there’s implementations of Raft, and to some degree it seems like, “Ah, I can take Raft, I can take RocksDB, I can put some glue between and I’ll have some kind of distributed, consistent key value store.” I think the surprise is, and something that we should talk more about, is how hard it is to go from having these component technologies and putting it into an entire working system.
Ben: Yeah. I built the first version of Raft that we used which covered about 80% of the Raft paper in a week. Then we’ve spent the last two years turning to Raft from a key value, from a basic Raft implementation into a full fledged transactional key value store. That’s a lot harder than it looks on paper. I’d also say that a big surprise in the road to 1.0 is just how much breadth you have to cover in the SQL standard to be usable with existing libraries and frameworks. We thought that we would get the core part of the SQL standard that everybody uses and then we’d be able to support, hibernate and active record, and all of these kinds of frameworks, but it turns out that each of these frameworks goes pretty deep in terms of how much of the SQL supported by each database that they actually make use of. In order to be compatible with Postgres dialects of SQL, you have to really implement a very large portion of it exactly as Postgres does.
Peter: Yeah. The interesting bit for me there is that ORMs, the SQL they generate for the application, it’s reasonably standard and not too esoteric, but all these ORMs at startup time, they query the database and essentially you’re doing reflection to find out the tables, the columns and whatnot. There is SQL standards information schema for providing that information, but the querying every little detail there, it feels like every tiny corner of what Postgres provides, some ORM utilizes.
Spencer: There’s also a very long uphill slog in terms of getting us to asymptotically approach a decent level of stability in the system, and this is something that we’ve written numerous blog posts about in the course of the last year. This was mostly, I’d say it’s probably the largest single driving effort within the core part of CockroachDB. It must’ve spanned three quarters, four quarters, probably, where we were battling both the idiosyncrasies, the different public cloud providers. They have all kinds of problems with their clocks. They have very different strange layers in their storage devices that we found in some. We filed a number of different bugs for them, some of which have been fixed, some of which haven’t, and so we’ve had to do lots of different workarounds. There’s a huge amount of work to be done around the rate limiting on our side so that we don’t run afoul of various limits that are imposed on the processes. That’s been a long running battle, and I think this supports a lot of what Pete was saying, when talking about how much work it is actually to go from the conceptual idea of a distributed database to the actual practical reality of something that can run in lots of different environments.
Ben: If you could pick a company to embody the ideal CockroachDB 1.0 use case, which would you choose and what would the project look like?
Peter: I don’t have a particular company or project in mind, just some attributes of what I would look for. You’d want a project that has a moderate scale. Scale isn’t of primary importance, but you wouldn’t want to choose something that the expected scale is always going to stick with on a single machine. You want something that the project might start, where it can stick on a, be run off a single machine, but has expected scale in the future where you have a reasonable concern that you would run into some issues there. Also, the project should have some tolerance for risk. This is a 1.0 version of a product. It’s a pretty new storage system into any company. It should be approached with a modicum of caution, so there should be some risk tolerance there.
Ben: Yeah, I think to Peter’s point about scale, I don’t think it’s a bad idea to use Cockroach even if you don’t expect to scale up. I think that the consistent replication of CockroachDB can be valuable even if you’re only scaling to a single machine in each replica. I think that we’re aiming to be a better database solution even for, at all levels of scale, even from small to large, compared to existing databases, but the ideal case is definitely something that has the potential to grow and need the scalability.
Peter: Well that’s the last of our questions. Thank you for listening to our first ever Cockroach Labs podcast. For any developers out there, take our 1.0 for a spin. And if you’re in San Francisco or New York City, come to our meetups! Ben will be giving an under the hood look at everything that went into making CockroachDB production ready. He will be in SF tonight May 11, and back in NYC on Wednesday May 17. You can RSVP on the cockroachDB user group pages on meetup.com. Thanks again everyone. Until next time.