Episode 5
Building reliable systems with DoorDash's
Matt Ranney
Matt Ranney
DoorDash
Never miss an episode
This week we’re joined by Matt Ranney, Principal Engineer at DoorDash who emphasizes the importance of finding productivity angels in smaller teams with more automation, and how understanding that unit testing isn’t the only valuable metric can lead to building more efficient infrastructure systems. Matt also discusses the pros and cons of Kubernetes and microsystems, and provides call-to-actions when Doordash experiences failures or downtimes on microservices. He stresses the importance of acknowledging that partial failure is still complete failure, and how taking steps to prevent and mitigate these failures can greatly improve overall system reliability. Join as we discuss:
Tim Veil:
Well, welcome to another episode of Big Ideas in App Architecture. I am super excited today to welcome Matt Ranney on the show. Matt is a, what should I say, principal engineer? What’s the official title, principal engineer at DoorDash?
Matt Ranney:
That is my current title, yes.
Tim Veil:
That is your current title. So I’d love to get this thing kicked off by just learning a little bit more about you. I like to start episodes this way just because I think everybody has such a unique and interesting story to tell. And I had the opportunity not only to meet with you beforehand but kind of take a look through your LinkedIn profile, and you have done some really interesting and exciting things. So maybe we just start with what you’re up to now, but maybe we’ll spend a little time getting to know you, about where you were before being at DoorDash.
Matt Ranney:
Well, way back when, I worked in the ISP business, like doing internet routing, back when the internet was much, much smaller, and really got exposed to a bunch of cool stuff about the way things really work behind the scenes. And that kind of shaped the way that I went about the rest of the things that I worked on is seeing the internet become a real thing, and working behind the scenes on it was pretty cool. So I did a bunch of related things for a while, mostly trying to measure or visualize or do analytics on helping people understand what their networks were doing. So did that for a while and then had an opportunity to work on a pretty interesting application networking technology, which was making voiceover IP work in Iraq. And so I know that’s kind of out of left field a bit, but that’s just the way life works sometimes is I got that opportunity. And it turned out, it’s a super interesting problem. It’s really, really hard to… We’re using these satellite internet, like geostationary satellite internet, high latency. Everyone says, “Oh, you can’t do voice on that,” but it turns out you can, it just means there will be quite a lot of latency. Everyone said you couldn’t do it and I was like, “I bet you can do it.” And it turns out, you can. And so I did that for a while. And in the process of working on that problem, the co-founder of this company that I had done this work for, we got together and we thought we would try to do a sort of consumer version of this now that mobile phones were getting more powerful. And so I started a company with him called Voxer, which was kind of a continuation of that same idea, which was like-
Tim Veil:
And when was this? So this was in the early 2000s, was it?
Matt Ranney:
Yeah, yeah, about then. I’m bad with dates unfortunately. I always have to look it up.
Tim Veil:
Yeah, 2007-ish, I think, if I look at your profile?
Matt Ranney:
Yeah.
Tim Veil:
Okay, that makes sense.
Matt Ranney:
Anyway, we started before the first iPhone came out, and so we were trying to make this work on pre-iPhone devices and that was not very easy. Pre-iPhone devices were not so good. But anyway, that was a very interesting thing to work on, trying to make live voice work on, by modern standards, very low-powered devices, very slow and unreliable mobile data, but we did. We made it work as well as you could make it work and got sort of popular, but that was… I worked on that for quite a few years. It took a long time before we actually had something that was useful. But yeah, that was a really cool thing that I worked on. But after a while, that company neither succeeded nor failed, and living in the SF Bay Area is very expensive, and so needed to move on and go do something else. And so I went to go work at Uber. And there, I worked on the backend infra and scaling things. An interesting thing that I had run into working at Voxer was we had not a lot of money, but we had a lot of users, so we were really good at giving stuff away for free, which is the business model of lots of startups, just give stuff away for free. We were really good at giving stuff away for free.
Tim Veil:
Yeah, that’s a talent.
Matt Ranney:
Yeah. I mean, some people, the better you are at it, the more free stuff people expect to always be free. But anyway, there was an interesting kind of dynamic that we had, which is we didn’t have very many employees. Places like Uber or DoorDash or whatever, thousands of engineers. You go to your Facebooks and your Googles and you’re going to get tens of thousands of engineers. We had seven. And so, in order to support millions of concurrent users with seven engineers, we had to do some kind of clever architectural choices. And amazingly, that system, as far as I know, is still working, even though I don’t think anyone’s really done much to it in years and years. But yeah, I learned a lot about how you can do some big scale stuff with not a big engineering team. Then I went to Uber, where we had a big engineering team, or certainly we built one right before my eyes while I was there, and worked on some similar problems, how can we make this thing reliable and that sort of thing. Then I got an opportunity while I was there when Uber started to build a self-driving car program. I had actually worked on a self-driving car for the DARPA challenge with some other friends in the Bay Area, and we made one out of a old car we bought on Craigslist, and borrowed some early LiDARs, and made a-
Tim Veil:
I see a consistent theme here, making a lot with very little seems to be kind of flowing through the story here.
Matt Ranney:
Yeah, yeah, yeah, yeah. Well, anyway, so then Uber had started this program, and I just kept persistently asking until eventually they were like, “I guess you don’t have a degree in robotics, but you do seem to know a little bit, and, fine.” So I started working on self-driving, which is yet another super interesting different problem. And there, I was working on building a simulator to validate the autonomy software. In case you don’t know it, these cars are very, very expensive, the prototype versions of them are. They don’t have very many of them and they cost a lot. So if you want to test your software, you probably don’t want to test it on a vehicle. Even on a closed course or whatever, it’s just way too slow, right? You want to test it in CI, in some kind of non-car environment.
Tim Veil:
Sure.
Matt Ranney:
So anyway, we built this simulator that was like a video game that the cars would drive around in. Yep, so did that for a while, and that autonomy problem is a real tough one. And then I had an opportunity, then, to go work with some of my former Uber colleagues who had gone to DoorDash. And so I’ve been there for a couple years now. And I am working on similar kind of backend things and just, in general, how can we scale an engineering team, like not have everyone be blocked all the time basically? What kind of abstractions or interfaces or systems can we build so that we can harness the full power of this mighty engineering team?
Tim Veil:
Yeah, so there’s so much I want to unpack there. So first of all, thank you for explaining your history, and it is certainly a fascinating one. I think one of the things I’d love to explore a little bit, because it’s top of mind for us right now, is just what you ended on, which is this idea of properly scaling engineering teams. Because I think one of the things that you called out, and I don’t want to put you on the spot here, because I’m sure your current employer may end up listening, but I think one of the things that certainly I have felt recently is an appreciation, maybe even a longing, for times when things were smaller, there were less people, you had more control. And you’ve now seen it really from, I think, both extremes, where you’re starting a company, you get to have all the control, up to working with, obviously, these very large, global companies, where there’s now hundreds or thousands. I mean, can we talk a little bit just about, A, I think I’d love to understand a little bit about what DoorDash’s philosophy is in general, but maybe more specifically, what’s your philosophy about scaling teams? And then, I guess, not wanting to put you on the spotlight, which, maybe not the question is which do you prefer, but what are the pros and cons maybe of those really small, tight-knit teams, where you have to do everything, you have to go the extra mile, versus these large teams? I know there was a lot there, but I just think there’s so much we could chat about on your experience in building these teams.
Matt Ranney:
Let me just say something real quick about the small team environment that we had at Voxer. There is a power to the constraint in that, if you truly only ever only have seven people, you just can’t do everything. Not only can you not do everything you want to do, you just can’t even think about, you can’t even say, “Boy, it sure would be nice.” You don’t even have time for that. We’re just like, “Listen, what is something that I know we can do? And I know it will just be automated.” And I think that, curiously, as teams get larger, you end up having lots more manual steps, just because, I don’t know, it sort of seems like that’s responsible or you can’t think of any other way to do it. But when there’s only seven people, it’s like, “No, this 100% has to be automated. There’s no way we’re ever looking up any of this.” If users have problems like… I don’t know. Unless we can see the aggregate user behavior change, we’re never going to do anything about it. And obviously, if they don’t want to pay us any money, we’re giving stuff away for free, right? So that’s the deal. At a place like Uber or DoorDash, where you’re moving money around, you can’t just let customers have a bad day, right? You have to actually help them all. So it’s not entirely a fair comparison, but I think that the interesting thing about, though, the extreme, like taking it all the way to like, “Oh, seven people, wow,” you start by thinking, “How will this be automated?” And I definitely know that what most of the other places that I have been in, they start by saying, “Well, let’s just get some docs together and write a bunch of docs. Well, we can probably just do all this in a spreadsheet,” like these manual processes. It just seems easy because you have so many people.
Tim Veil:
I wonder, too, if there’s something in there about… I don’t want to call it productivity because that could be kind of a dangerous concept, but I do feel like when there are only seven people, when you have very limited resources, and you know can’t go back to finance or whomever and say, “Look, I need another person to do this. I need another person to do that,” that you’re almost forced into making better decisions, whether it’s architecturally or for whatever reasons. And when those constraints are removed, it seems very easy to put everything off, maybe I won’t do it quite as well as I know I could or I won’t think it through quite as well as I could, because I can always just get somebody else or I can add a resource or capacity to overcome challenges. Do you think there’s a productivity angle, where you’re kind of working harder and more focused in these smaller teams than maybe in larger teams?
Matt Ranney:
Yeah, I mean, I think there’s certainly something there. I mean, I think there’s definitely something in terms of the more people that you have to coordinate with, just because there are more people and you need to tell them what you’re doing, that’s just inherently inefficient, right? You can’t need to tell everybody everything. There’s no way that that can ever work. Yeah, I mean, so I guess I should be really, really clear that, just because we pulled this off with seven people, I bet you we upset a lot of our users because we just couldn’t fix problems. And I also know that the number of product features that we had was much smaller than like a DoorDash, right? DoorDash has vast complexity in all of the different ways in which it can work and all the different humans that are involved as part of the way that the system works. It’s just way, way, way more complicated. That said, in a big organization, I think it can be tempting to sort of ignore or overlook this kind of automation-first idea and do things that seem, in a really weird way, more correct. People will say, “Oh, yes, it’s more correct if we adopt this design pattern that I read about,” or do something in the name of correctness because we have the engineering resources to do it right. And I think, curiously, a lot of times, that those so-called correct solutions end up being just more expensive and harder for everyone to work with. They end up having in many ways the opposite effect of scaling and flexibility. I mean, it is a weird paradox that you can fall into.
Tim Veil:
Out of curiosity, do you have a technology or a pattern that comes to mind when you’re saying this that you could share? Because something came to mind to me, but I’m not sure we’re talking about the same thing, and I’d love to hear what your thought was.
Matt Ranney:
Yeah. Sure, sure, sure. I mean, I’m meaning it in a fairly general way. But just as two examples of just how general this is, here’s one, which is, a lot of folks will say, “Oh, it’s super important that we have this exact number of test coverage.” They’ll be like, “Oh, this is best practice, we got to get to 80…” I don’t know, some number, right?
Tim Veil:
Some number, right.
Matt Ranney:
“… Some number of unit test coverage.” But somehow, even with that, we still ship bugs. We still ship bugs. The thing, it still crashes, but how is that possible? Well, it’s possible because unit tests are very, very narrowly focused in a distributed system, which nearly everyone is building these days, ends up actually not being as useful as integration or functional tests or whatever you want to call putting all the pieces together and running it, like running the whole thing end to end. I mean, it’s somewhat counterintuitive in a way, because people have been told, “Unit test coverage is the way professional software engineers do things.” And even now, I am afraid of what people are going to say when they hear me say that. Am I implying that unit tests are bad? No, I’m not saying they’re bad, but this kind of strict adherence to like, “We can’t ship this without these [inaudible 00:18:51].”
Tim Veil:
I think it’s very true. That’s actually a really good one. It’s not the one I was thinking of, but I think it’s a really good one. And it is emblematic of, I think, something we see a lot in this field, and I’ve seen this in a number of places in a number of ways, where we get very focused on certain metrics and, “I am going to meet this metric, come hell or high water,” without maybe stopping to reconsider, is that exactly relevant any longer given other things that are happening in my stack or other things that are happening in the world? It’s definitely something that kind of resonates with me. It’s like, “Oh, yeah, yeah, I have to have 80% test coverage.” Well, okay, but are we maybe in danger of optimizing for the wrong metric? Maybe it’s not the amount of test coverage, maybe it’s the number of P1s or P2s or whatever other end result to the customer is. And I think, oftentimes, at least in software engineering application architecture, I feel like sometimes we get stuck in optimizing toward metrics that… Yes, I agree wholeheartedly with unit testing, it’s a wonderful thing, but you can’t run around and say, “Mission accomplished,” just because you hit a certain number of test coverage because, you’re right, I mean, bugs will still happen.
Matt Ranney:
Yep, yep, you’re still going to ship bugs. Yeah, well, here’s another example. People have a lot of strong opinions about what kind of database you should use and how you should use it, perhaps relevant to your professional interests. Here’s an example. “Well, to do what you’re suggesting, we will have to give up on strong consistency.” “Isn’t that bad? I’ve heard it was bad.” “We’re moving money around. You can’t have eventual consistency if you’re moving money around, right?” And the funny part is that’s literally how the financial network works, none of that is strong consistency. But anyway, on the same topic, like, “Oh, there’s a way that you should design your schema, you should never de-normalize things, etc.” And it’s like, “Sometimes. Maybe that’s true sometimes,” but I think people get… For those same reasons, while trying to be good professional software engineers, they will end up making a much harder system to work on and maintain and operate than if they were just like, “Well, why are we doing this?”
Tim Veil:
I’m curious what your thoughts are on two other things that have come to mind. I think we’ve talked a little bit about them in previous podcasts, and they’re front and center, at least in my mind, with some of the work we’ve been doing recently. Obviously, I want to talk more about database stuff, but two other technologies or concepts I think sometimes we get too religious about is, one, and you and I may not agree on this, is-
Matt Ranney:
Uh-oh, controversy.
Tim Veil:
Well, no, it’s just the pendulum swung so hard to microservices, like everything, everything, everything all the time. So that’s one, and then the other one, which, and again, I’d love to hear more about what you guys are doing relative to both these things, the other thing in technology, I think, sometimes, at least in my work out in the field working with all sorts of organizations, is people kind of like, “I’ve got to adopt Kubernetes. I have to.” “But, why?” “You know what? I just have to.” And so I feel like, at least in the work we’re seeing, there’s certainly the database aspect, but it’s, “Man, we’re migrating everything we have to microservices,” and, “We’re migrating everything we have to Kubernetes.” And sometimes, my experience has been, I’m curious to yours, those technologies also come with a ton of complexity that sometimes isn’t always obvious.
Matt Ranney:
I have a sort of high-level observation about… Is it about both? Well, it’s definitely about one of those things. Maybe it’s about both, we’ll see. Well, I’m going to say it anyway, which is something like these kind of infrastructure choices, like, “Oh, you should use Kubernetes,” or whatever, the fact that so many people know what Kubernetes even is is kind of surprising to me. Because I feel like at this point it should be so low in the stack that people interact with, I don’t know why we would expose most people to Kubernetes. I mean, it’s super fiddly and low-level. It’s got a million different options and it’s like looking into the engine room of your system. And, yes, someone’s got to go in there. Yeah, of course, they do, obviously, but not everybody. So I don’t really care one way or the other if people use Kubernetes, I just don’t think we should be exposing those low-level interfaces to most people writing services to run a business.
Tim Veil:
I totally agree with you. It’s another example in my mind of those things that people get very caught up in. It’s like, “I must meet this metric,” or, “I must adopt this technology,” but it’s like, “Oh, well, wait a minute. I might be able to get you to the same end goal, or what goal you really want, without necessarily worrying about things that, A, are archaic or, B, not necessarily at the level of detail you need to be concerning yourself with.”
Matt Ranney:
And for what it’s worth, Kubernetes, fine. I got nothing against it. I just think that it is a natural progression; we should all be trying to automate ourselves out of a job and move up the stack. We used to fiddle around in data centers, and I’d spend all this time in these deeply air-conditioned rooms racking and stacking and being very proud of very tidy cable management. It has been a long time since I’ve been in a data center, and Kubernetes is like that, I think. You didn’t used to send a lot of people to the data center. It used to be like one or two people that even had the access card or whatever, but I think that applies to this as well. I think that should just be abstracted away. But I want to talk about the other topic you mentioned, which is microservices, and why do people do it, and is it good or not. So I’ve given quite a few talks about this topic, so I have any number of opinions, could go on and on, but I think the really interesting thing to get really clear about before you do it is why. So most people start out with some kind of a monolith. Whether it’s a Rails or Django or whatever, they’ve got something that got their service going, and now it’s getting hard so it’s cool to say like, “Oh, we should do this as services.” But to do it without knowing exactly why you want that… Because I think people just sort of assume that it’s all good. Because they see the bad. They’re like, “Oh, this is getting really hard to work. We’re all stepping on each other.” And yeah, you probably are. It’s hard without really good build and release tooling, merge cues, clever analysis about the way the modules are laid out to know whether one change can affect another change and stuff like that. You can still do it, people do do it, but it doesn’t happen without work. So I totally get that people run into friction with monolith and go, “Hey, microservices, here, let’s go.” And it’s super fun at first, but the main thing that I think that I wish everyone would fully internalize is that you’re adding a new dimension of partial failure, and those are very hard to test for. I think if we would just get a handle on that, the rest of the problems would sort of go away. If we had a really good strategy for partial failure or just understanding, “Oh, wait, you mean sometimes this won’t work? Hmm.” I don’t know. Is that good? That might be bad. It’s like, “Well, what are you supposed to do?” So Netflix has talked a lot about like, “Oh, we got fallbacks and we can do a degraded experience.” And that’s cool, I think, in many cases, depending on what the product is and what the service is. And there’s obvious things you could do for fallbacks, but what are you supposed to do at DoorDash when you use the check out and it’s like you got a degraded an experience? We didn’t quite check out, you didn’t quite get your food, but we’ll show you a picture of it or something and you can imagine what it would be like. We’ll give you someone else’s food, right? Obviously not, right? We actually have to give you your food. So it doesn’t always make sense. And just reasoning about that, like partial failure or degraded modes of operating, and how to validate that it’s doing the correct thing in these partial failure cases, that is the hard part.
Tim Veil:
Yeah, I think that it’s very true, not only for microservices but, I think, large distributed systems in general. I mean, they are wonderfully complex and solve some really interesting and important things, but I think my thinking on all of it is you just have to be aware that there is this hidden complexity, whether it’s for exactly the reasons you described or other things. And sometimes these panaceas, I think, that we create for ourselves don’t always turn out that way. Like you said, maybe the early period, the honeymoon period, looks great; but the farther you get into this, when things start to go bump in the night, or bump in the day as I remember Sean saying once on a webinar, a former colleague of yours, that’s when things get really scary. And the more complex the underlying system, the harder it is, I think, to figure out exactly what went wrong.
Matt Ranney:
Oh, there’s one more thing I wanted to say about that, which is that I think that it is possible to do the kind of testing that we need to do for microservice architectures, but for whatever reason, people don’t do it or, if they do do it, they don’t talk about it. So I’m pretty sure it’s the first one. I’m pretty sure that they just wait for it to fail. And people talk about like, “Oh, chaos testing. Oh, let’s just break stuff and see what happens.” But breaking stuff to see what happens tells you what… Maybe you understand why it happened, maybe. But without understanding what it should do, when that thing breaks, it ends up I don’t even think to be that useful. So what I think we need to do is build out fault injection. So you run these end-to-end tests with precisely injected faults, and the test harness knows that the fault has been injected, and it knows what the correct behavior is when this partial failure condition happens. And I don’t know of any way to specify that. How do you write a test that says, “By the way, if A calls B calls C, if C fails, A’s test can know that, when C fails, this is the right thing to do”? So I think other people have come at it from different angles. We are working on a similar system to that, which is very promising so far. I think it’s going to be very cool. I think we actually did do a blog post about it already on the DoorDash blog, this framework called Filibuster.
Tim Veil:
Hmm. Well, tell us about it.
Matt Ranney:
You might want to check that out.
Tim Veil:
Or, tell us which can about it, yeah.
Matt Ranney:
Yeah, I mean, we wrote about it, but, basically, it tries to address that problem. It gives you a way to write your tests that has hooks that say, “We injected these faults,” so when you’re handling errors you can say, “Oh, okay. Yeah, this is actually what we should have done.” But basically, it intercepts the RPCs with some clever magic, like the same way that open telemetry and other kind of similar systems work, where you could kind of transparently interpose on your network calls and break them, but in a precise way. So anyway, it’s still pretty early days for us on that, but I think it’s going to be really good. I think something like that, if we got that sort of standardized and well accepted, I think that fixes most of the problems.
Tim Veil:
Yeah, I can see that being enormously useful; certainly, I think for us and what we do. I mean, understanding failures and being able to inject those would be enormously powerful. I think where I was going to go, and I think it’s actually a really, really good segue, which is, I think, underpinning your comments about testing is, when things go wrong, there is an impact. I mean, you were kind of joking about orders not being fulfilled. What is the DoorDash thinking on the impact of these kinds of failures? And what does it mean to DoorDash if a system goes down, microservices go down, a database goes down? How do y’all think about that? How are you working to resolve that? What’s the impact of those things? I just think that’s such a fascinating company because, as we’ve seen over the last couple years with Covid and the like, DoorDash’s popularity, I think, has gone through the roof with tons and tons of usage. I mean, how do you guys think about it? What are the impact when things don’t work?
Matt Ranney:
Yeah, I mean, well, like I said, it’s really hard. Partial failure is complete failure for a lot of stuff that we do. And we have a lot of services and a lot of them have to be up. Now, we have been able to carve out some kind of domains or phases of the ordering flow so that they’re allowed to fail independently. We’ve now gotten to the point where we have multiple deployments that are sharded geographically. So if one of them breaks, it doesn’t always cause a global outage. So we’re doing some bulkheading by the markets and also by the phases of the order. So the browsing and then the interacting with the merchant and then the interacting with the dasher, those three things can kind of fail independently of each other. And so maybe no new orders come in for a little bit, but all the ones that came in will keep getting fulfilled, or the other way around, maybe we can accept them briefly while we fix the problem with the fulfillment side. But yeah, I mean, it’s been a huge project. That’s basically what I’ve been working on since, for most of the time, that I started. I’ve only somewhat recently been shifting to like, “How can we get more…” I mean, how we get more efficiency with this large team is somewhat more of a recent project, but mostly I’ve been working on how can we make our system more reliable? And that was the two things that we did was partitioning the flows so that they’re somewhat isolated and bulkheading by market.
Tim Veil:
Now, I think we’ve talked about this publicly, I mean, you guys are users of Cockroach, correct?
Matt Ranney:
Oh, yeah. Oh, yeah.
Tim Veil:
Can you tell us just a little bit about where it fits in, or how it fits in, or what y’all are using it for? And obviously, don’t share things you’re not comfortable sharing, but just to be curious to understand what drew you to the technology and how it’s being used to some extent.
Matt Ranney:
Sure, sure. So we have a lot of services. It’s hard to give you the exact number because what we don’t have is a crisp definition of what constitutes a service. And so, anyway, I mean, let’s just say it’s definitely hundreds. It’s many hundreds of what you might call services, and most of them have some amount of state that they need to maintain, and a lot of them use Cockroach to maintain that state. And it’s for all… Yeah, I mean, we have hundreds of Cockroach clusters.
Tim Veil:
That’s awesome. We’ve been talking a lot about technology. One of the things I have been interested in really this entire time is what… And I know people who are just listening won’t be able to appreciate this, but what on earth is happening behind you? I think when you and I first met, I thought maybe this was a background, but it is not a background. It’s your real background.
Matt Ranney:
Yep. Yeah.
Tim Veil:
Tell us a little bit about what clearly interests you outside of keeping DoorDash up and around.
Matt Ranney:
Yeah, for sure. So this is a building out behind my garage in Pittsburgh, Pennsylvania, where I live, and it started as a music studio. And then Covid happened, and then it turned into a music studio/office. And so my band plays here. I have a bunch of different instruments and I will screw around with making electronic music or whatever, but, unfortunately for my recreational interests, mostly what I do in here is work. But I do like to make music when I get to.
Tim Veil:
Now, I see a guitar up there. Are you a guitar player? I mean, are there multiple instruments that you play or is it-
Matt Ranney:
Yeah, so there’s an electric guitar over there, and there’s a bass guitar over there, and then… I don’t know if the auto zoom is going to support, but I’ve got some pad controllers and keyboard. There’s a bunch of fun stuff in here.
Tim Veil:
Yeah, for those of you who can’t see, it’s by far the best background I think I’ve ever seen on a Zoom meeting in the last couple years. It seems like it’d be the place to be.
Matt Ranney:
Yep. Yeah, it is a fun place for sure.
Tim Veil:
So maybe as we kind of wrap up here, one of the things I’ve enjoyed listening to or hearing from various folks is kind of things that they’re excited about coming into this upcoming year. Obviously, we’ve spent the last couple years in a world of challenges and change, but, I don’t know, maybe it’s because it’s Spring, maybe it’s because our new fiscal year is starting, I feel like there’s just some optimism. What are some things you’re excited about as you look forward to this upcoming year, whether it’s personally or things you’re doing at DoorDash or with technology? I’m just curious what you’re kind of excited about looking forward.
Matt Ranney:
Yeah, yeah. Well, let’s see. I guess a couple of things. I’m pretty excited that we are now finally working on embracing this kind of change-based data access, so we’re leaning into CDC really hard. And the amount of the… Let’s see, how do I put this? We do way more reads than writes, like way, way more, because we have hundreds of services and they’re all kind of percolating their data around. And I am excited about being able to do fewer trips through the call graph. And so that’s why we are doing a lot with CDC, and I think it’s going to… It’s still pretty early days, but that stuff is going to mature soon and I think that’s going to be very cool. But the new project that I’m working on is, I think, also pretty exciting. It has nothing to do with data storage. Kind of, I guess not really, but it sort of does. Which is something I’ve been wanting to build for a long, long time, and it is a way that we can… What I’ve been calling it, composable event processors. So we have this model where there are these workflows that are important parts of the way the product works, but lots of different people want to make changes to that one thing so the microservices model doesn’t work, because everybody needs to get into this. They want to change what happens when you click the checkout button. It’s like, “That’s a pretty important button,” and a lot of people have different stuff that they want to do in there. So we’re working on building a kind of plug-in architecture for these important flows that will allow different teams to safely make changes to how these core flows work. And the safely is the really interesting part. It’s still early days on that. I hope when you ask me about this again in six months or a year I will have great stories to tell you about how great it is. But anyway, that’s what I’m excited about is it’s a really hard problem.
Tim Veil:
The first thing you mentioned with CDC, are you referring to change data capture, kind of the Cockroach thing, or just in more of a general concept?
Matt Ranney:
Yeah. I mean, certainly with all of our Cockroach we are, yes, but we wanted the same from our other storage systems as well.
Tim Veil:
Yeah. Oh, that makes sense. Yeah, change feed, just in general, has been kind of a really interesting thing, just this ability to emit changes as things are happening in the database is certainly pretty interesting and, I know, an area we’ve been making tons and tons of investment and change in.
Matt Ranney:
Yeah, I think that it is incredibly powerful. And that is one of those things that I think when I first heard about it I didn’t quite get like, “Why? What is the big deal with that?”
Tim Veil:
Yeah, what is this thing?
Matt Ranney:
I was like, “Well, yeah, okay, fine, you can make your data warehouse get updated more efficiently.” But building and letting other people build projections of your data, I think, is the really interesting thing.
Tim Veil:
It’s been fascinating, at least certainly from my perspective, just talking about and working with a bunch of external customers. It’s starting to be kind of the underpinning of a lot of really neat things. And I think as a result, we’re certainly, like I said, investing a lot of time and energy in just making sure that it can meet those needs from a security, stability, reliability, and, maybe more importantly, observability perspective. A final thought, I don’t want to keep you much more, but again, you and I talked about it before, obviously, my background is a bunch of books. I don’t have musical instruments like you, so I have books. Anything you’re reading, that you like to read, that you’d recommend to people about technology or otherwise?
Matt Ranney:
I have to say that when I sit down to read, it is often… Well, I would say it’s almost always not about technology, but I do like to watch a lot of… I watch a lot of conference videos, talks people have given at other conferences, and stuff like that. Yeah, I mean, that’s kind of the main way that I learn about stuff that’s going on is watching people’s conferences.
Tim Veil:
Just on that thought, just verify for me, I mean, do you find, as I do, it’s difficult to stay on top of all the changes that are happening out there with technology? I mean, it seems like every day there’s a new solution to a new problem. Especially in your role, do you find it somewhat overwhelming to stay on top of all the happenings?
Matt Ranney:
Yeah, I mean, I guess I feel like I used to, but I’ve sort of made peace with it because-
Tim Veil:
I think it’s important that you do that, yeah.
Matt Ranney:
Yeah. Well, I think, working at a bigger company, you basically have to. Because while some things, if you all agree that you’re going to go do, you can do amazing things with vast engineering resources, but if you ever want to change your mind, it is actually kind of hard. And so you just have to be okay saying, “Well, this is what we’re going to be doing for the next year or two. And, I mean, I don’t know, maybe there’s some cooler way to do it, but we’re not going to do it because it’s too hard to switch at this point. Maybe we’ll think about it again in a couple years.” I mean, it sounds maybe sad if you’re coming from a startup world, but, in a way, it’s freeing, because you’re just like, “This is what we’re doing. We’re doing this now.”
Tim Veil:
No, listen, I think that’s the reality of the world today, and I think it’s certainly true on our end being a storage provider, an OLTP database. I mean, those are not probably changes people want to be making a whole heck of a lot of times in their career, and so it is both freeing and sometimes challenging to be in that space. Matt, I really, really enjoyed this chat with you, getting to know you, getting to know your background and history. You’ve done some amazing things. We could have probably spent hours, and we may have you back to talk more, but we certainly could have spent even longer today talking about all the interesting work that you’ve done. So really, really appreciate your time here today. Thank you very much for being on the episode.
Matt Ranney:
Great. You’re welcome. Happy to do it. Yeah, thanks for having me.`
Big Ideas in App Architecture
A podcast for architects and engineers who are building modern, data-intensive applications and systems. In each weekly episode, an innovator joins host Tim Veil to share useful insights from their experiences building reliable, scalable, maintainable systems.
Tim Veil
Host, Big Ideas in App Architecture
Cockroach Labs
Latest episodes