Engineering resilient systems: Rescuing old treasures and unleashing modern capabilities
Author, Engineering Leader, Systems Geek
Never miss an episode
Are legacy systems just outdated systems? The answer is, it’s complicated…
In this episode, we’re joined by Marianne Bellotti, author of “Kill It With Fire”. Marianne has built data infrastructure for the United Nations and tackled some of the oldest and most complicated computer systems in the world as part of the United States Digital Service.
Join as we discuss:
I think people tend to assume that technology advances linearly, like we get progressively more advanced as we move on, and when you actually kind of look at the trends in technology, what you start to realize is that it’s cycles. It is moving to greater capacity, it’s moving to greater speed, more data. There are advancements that look linear, but the actual paradigms that we use tend to be cyclical.
What is up, everyone, and thanks for tuning in. In today’s episode of The Big Ideas in App Architecture podcast, we speak to Marianne Bellotti, a software engineer and author of the book Kill It With Fire. In this episode, Marianne and I talk about large system rescue and approaches executives and teams should take when working on these complex modernization products. Pump up that volume and get ready for an insightful conversation with Marianne Bellotti.
Welcome to the podcast, Marianne. How are you doing today?
I’m doing great. How are you?
I’m doing great. I was so excited to have you on the podcast today. Even though we’ve had a pre-conversation, it’s not every day that you get to talk to somebody who’s written a book, right? You’re one of the first guests I’m speaking to who has actually written an entire book. For the audience and everyone listening, Marianne has written this book that is called Kill with Fire and being somebody who is in the system who works in the space of technology, you’d really don’t talk about killing systems. Marianne, as we begin, let us know about what are you doing nowadays. I know I already talked about you being an author, but you’ve been in tech for a while, so let everyone know about who you are, what you’re doing right now, and kind of do an introduction for the people.
Sure. My name is Marianne Bellotti. I’ve been in tech for … I mean, a good solid 20 years, but I started programming when I was 13 years old. If you count from that, it’s actually … I don’t want to date how old I am, but it’s been much, much longer than that. If you count from when I actually first started working with computers and programming computers. I don’t always count that because I like to joke that I spent the first 10 years of my professional career trying not to become a software engineer as desperately as possible. I wanted to do international development, I wanted to travel, I wanted to go out and see the world. At the time, that really wasn’t what computer people did. So that ended up being hugely beneficial to me because it built out this set of skills around how organizations work and this notion that not every place was the place that I lived in and the community that I grew up in, that other people had different ways of organizing things.
When I finally did come back, did come into technology, around probably 2010 sort of area and become a professional software engineer, I was coming with a very different perspective on how to think about systems. I worked for basically as a data engineer somewhere in the private sector, public sector kind of intersection. I did a lot of work for organizations like the UN and some government work and then some startups in the New York area. As I started to grow my career, I realized that the work that I really loved doing was the system rescue type work. Whereas other engineers were afraid to be on call or didn’t want to be in the incident room and flip panic, “Nothing go wrong while I am near the system”, I was kind of like, “Yeah, let’s go. This is great.”
I mean, you have to be a very strange sort of personality to enjoy that type of work and I found that I did enjoy it quite a lot and as it went on, I found that I actually seem to be somewhat good at it. That has sort of been my specialization since then. I tend to work on a lot of legacy systems, a lot of old computer systems.
I look at you in three specific functions or the way you’re operating right now. You’ve worked on rescue systems, how to help people who are trying to be on these old systems want to move to main modern systems and you’re helping them out there, but you also have a function as an author and you also help companies on training them or the folks and helping them learn as to how they need to look at these systems and evolve and plan these things out with the experience that you’ve had. What was really interesting for me was also the fact that you call yourself a software engineer and a relapsed anthropologist.
Yeah. I mean, when I started doing the legacy modernization work, everybody thought about that as being a completely different type of technical work. It was discussed as if modernizing systems was entirely different process from building them. One of the first things I noticed when I got into the weeds on it is that the things that worked when you were modernizing systems also worked really well when you’re building new systems as well. I think that basically comes from that background in anthropology. When I went to college, I was kind of bored with the idea of being a computer science major, so I was an anthro major, and that is how I got my gallivanting around the world for the first 10 years of my life is as a person who’d studied anthropology. I find that it’s at the foundation for how I think, always go back to that perspective.
Honestly, computer science is not unfriendly to that perspective at all. I think we kind of whitewashed that narrative out of our history. People know Conway’s Law, very few people have read Conway’s actual essay that developed that law because you’ll listen to people talk about Conway’s Law and they’ll talk about it as if in order to get the correct architecture on a project, you should reorg. I’m like, “That’s not all he’s saying.” Yes, if you wanted to give it a one sentence summary, people build systems that look like their org charts is an accurate one sentence summary of Conway’s Law, but when you read his actual essay, what he’s talking about is not where the boxes are on the PowerPoint org chart, he’s talking about communication pathways, he’s talking about information silos, he’s talking about how people actually relate to one another and how they communicate with one another and how they identify who is in their ingroup and who is in their outgroup.
He’s not talking about arbitrarily moving things around confluence and now you’re fixed. It’s really interesting the degree to which anthropology and the study in organizational science has heavily influenced our thinking about computer systems in the very beginning, but how committed we are to ignoring that fact. I found a nice little happy niche in this area. For other people, it seems sometimes odd that I have this specialization and yet I work primarily in a technical field. We’re either writing code or supervising people who are writing code, but I think for me it’s the best, it fits like a glove. It’s so much fun.
What you were saying is part of what is your primary focus or your philosophy to how you’re working right now is how culture influences the implementation and the development of software. I’ve never come across people who are also looking at that aspect. When you were talking about the Conway’s Law, you were kind of talking about some of the things that I was really fascinated when I saw your profile and when we were having this conversation. Let’s dive a little bit more deeper into what you were talking about, this rescue operation that you’re doing for large scale systems. Can you expand on that a little bit more as to what were some of your adventures? Of course, we don’t have to go through … talk about some of the highlights of these rescue operations.
I would say that a lot of times the first question, the most insightful question, is why is this a legacy system in the first place? Because people act like, “Oh, well it’s legacy because it’s old”, but really it’s legacy because it’s not maintained. There are plenty of systems that have been running for decades that people do not identify as legacy systems and do not complain about because there’s a lot of effort to operating and maintaining them. They don’t feel like they’re old systems because we change them on a regular basis, we keep them up to date, we keep them fresh. Inevitably the stuff that we identify as legacy is stuff that’s been neglected. That’s a situation where that happened for a very specific reason. There’s something about this system that either doesn’t provide enough value or the organization doesn’t understand its value.
One of the first things I tell new engineers in this world is that your natural instinct is going to be to sort of belittle and put the system down because all you’re going to see is all the things that are wrong with it, all the things that are strange and foreign and just seem like not the way you would do things today. But the thing you have to remember about legacy systems is that legacy systems at the end of the day are successful systems. People use them, they provide some sort of value, because if they didn’t, we’d have the easiest modernization pass in the world, we just turn the thing off. If we can’t turn the thing off, it’s because it’s providing some sort of value, but it’s also that value is not being well communicated to the organization itself because they haven’t actually been investing in it and maintaining it.
That’s really how I structure my engagements is what is the value that this thing is bringing to the table and why aren’t we communicating that? I talk a lot about when I plan out what we do and what order we do it in, I always take tack that for a lot of people comes off as really counterintuitive. My first instinct is to find the biggest, most complicated part of the problem and to tackle that, because you have the most buy-in the most momentum in the beginning. Over time, your executive stakeholders, particularly non-technical ones, are going to look at your modernization effort and go, “This is still going on? I need funding over here for this thing and there’s funding over here. Let’s cut the funding over here.”
When do you want to do the biggest, hardest, most complicated part of the project? Do you want to do it when you have little funding, when you’re fighting for staffing, when you have no attention of executive stakeholders or do you want to do it in the beginning when you have the most amount of resources you ever will have? Then on top of that, when you’re successful tackling that tough, meaty problem, everything else becomes easier at that point because you both have the momentum, you have proof of concept and the snowball is just rolling down the hill. A lot of times what I’ll see happen with modernization projects is they want to see migrate to a new platform and so they start with something small and simple to prove it out, and then they get a little bit bigger and a little bit bigger.
Well, what happens when you get to the more complicated use case? The platform you want to use doesn’t work. You’ve now migrated half of the system onto this other thing and you can’t migrate the other half of the system, so you have this Frankenstein system, you’ve literally made things worse. My first instinct is always to go for what is … let’s put things in order of impact and their complexity and then see if we can find something on the higher side of that scale where we feel like we could do this successfully.
Great. No, when you were saying that, there were two comments that you made that I really enjoyed or relate with. One is that the comment that when new people come into looking at an old system, they look at it, as you were saying, belittling it, because they knew something new and they have this latest tech going on and they’re like, “Well, this thing is so old, we need to move that.” But they forget, and I’m glad that you brought that up, the point that a legacy system has been running for quite some while successfully. That’s something, I would say, from an empathetic point of view from the social idea of being developers and working in an ecosystem of developers or with a team, we have to realize that those systems work well.
The question is, and I kind of bring that up myself, when I’m looking at migrating old stuff, is that how important is it for this old things to look new? You’re talking about modernizing, how important is it? What are you losing? As a company, I mean, I go back to looking at objectives and if the company doesn’t do this, what is a loss? The business objectives are equally something important for me. I really liked what you were saying there. The second aspect I liked is, and I recently had somebody else on the podcast who actually did this major migration across their company and did some legacy migration.
I liked what he said, they went after the worst, most complex system first because they felt, as to your point, if they could do that one and if that one works, that means anything and everything and work because you’ve kind of dealt with the worst closet in your room and you have to fix that one. I appreciate that you brought those points up as a seasoned rescue person of old legacy systems. The question I wanted to ask you was when you go and recommend that or suggest that to somebody who is thinking of going for the easiest problem to solve, how do they take it and how do you help them look at that and change their mind towards looking at a complex system as the priority?
I sometimes get misclassified as a security expert because so often that’s the lever that I pull to get people comfortable with these things, is when we’re doing kind of a system overview, I am looking at security impacts and security vulnerabilities because it is just the most effective way to get non-technical parts of the organization aligned and responding. The idea that they might get hacked, that they might be vulnerable to ransomware, that’s something awful in the Hollywood blockbuster style of things might happen to them. It helps calm and align everybody. But I also want to pull out something that you said before. I think the risk taking is balanced by the resistance to fetishizing new technology, which is definitely a thing that actually happens. I don’t mind systems that are written in COBOL, I don’t give a damn. You system is entirely in COBOL? Excellent, great, wonderful. I care about your ability to maintain it.
I have this concept that I talk about with people all the time. I’ll use the term modernization because it helps people figure out what I’m talking about, but my preference actually isn’t to modernize systems necessarily. I say restorative operational excellence. I don’t care how old the technology is, I care that you have people that know how to run it, that there’s more than one guy that knows what the thing does, that you feel comfortable operating and maintaining it. It’s that balancing act of understanding why it’s actually necessary to make a change in a particular way, and then being willing to take the risks.
I have this thing that I do with my engineering groups when we’re on a legacy system. It’s kind of like a game. One year I was on this system that had been built in the mid-’90s and it was data infrastructure and I actually thought it was quite remarkable and everybody on my team thought, “Oh God, this is an awful, awful mess. Just terrible.” But because it was doing multi-master replication, but again, it was built at a time when multi-master replication wasn’t actually a feature of SQL databases. Then it was doing what we would do with S3 third databases. Again, S3 would be the perfect thing to use, but it didn’t exist. I had a group of engineers, young engineers who were very much like, “This system sucks”, blah, blah, blah, “The engineers who built it must be idiots”, blah, blah, blah, and so finally one day I was like, “All right, let’s whiteboard this out. These are the features and the requirements of the system today. This is what it does. How would you build it? But you’ve got to build it in 1993.”
So they’re like, “Yeah, no problem. Easy.” They get up to the whiteboard, they start drawing it out. I’m like, “Okay, this wasn’t invented until 2007. It’s gone. This wasn’t around until 1999 and it’s 1993, so sorry you’re a couple years away.” As we kept going through it over and over again and they kept redoing the architecture, redoing the architecture, eventually there’s a point where they stopped and they realized that they are looking at the architecture that was built in 1993. They have essentially done the exact same thing and used the exact same solutions and that was an incredible exercise because again, like you said, it triggered empathy. Suddenly they realized these people that they’re talking to are not idiots, but they actually built something really amazing because they built it before any of the tooling existed to do it the way we would do it today. It helped them understand and value the system and therefore then come into those conversations with a lot better attitude rather than just looking and being like “Ew, I don’t want to touch this.”
I have so much respect for people who have built these systems and we go into conversations like I had with Cockroach Labs or other companies that I’ve worked with, we have really essentially helped companies scale with the massive amount of data that they’re generating. Sometimes old systems don’t work, but whenever we go into conversations, I at least have observed that these people who have been doing this for 20, 30 years really bring a perspective that maybe sometimes even we are not thinking about from business point because they have so much expertise. I feel like it’s really awesome that you are advocating for that idea back into how we do migrations and scenarios like that.
One question I wanted to ask was related to what you were saying when we are talking about these large systems or legacy systems, what I have personally experienced is folks [inaudible 00:19:02] IBM mainframe or DV2 and things like that. Without going into details of accounts where you worked it, what are these large systems that you’re talking about where folks were at and what were the platforms that you moved them to that you felt was the right choice for them technology wise?
I see a lot of hybrid mainframe cloud type stuff and what I mean by that is that the original system is still there, sometimes it’s still on the original hardware, which always blows my mind whenever I show up. There’s literally a machine for the ’80s running in the basement somewhere.
And you’re real happy [inaudible 00:19:49].
I’ve seen that happen.
That’s a different story. They’ll have the original system running and then over time there’s a desire by the business side of the organization. When I say business, I mean primarily business mission. They aren’t necessarily for-profits. The business side of the organization at some point, probably early 2000s into mid 2000s kind of go, “Oh, there’s this thing called the internet. We should be getting data there. We should get connected to the internet.” So they build these layers of middleware of various stages of complexity, usually in Java to sit on top of the mainframe and interact between the mainframe and the public internet. Sometimes when we come in, they’ve already done a couple cycles in modernization and they’ve moved that Java middleware into the cloud, and now they’re trying to figure out how to get the mainframe code into the cloud, which is a thing you can do.
I think if you pay IBM lots of money, they’ll help you do it. It’s a thing. But sometimes not, right? A lot of times when you see these outages on mainframe systems and COBOL get thrown around, the last one that I think really got a lot of attention was the state of New Jersey when they were doing their COVID benefits processes there, they had a major outage and people tend to gravitate towards the mainframe and the COBOL part of that story. But I had a bunch of friends who called me and I was not in the loop on the New Jersey thing, although I had a lot of former colleagues that were in the loop on the New Jersey thing. So I said that same thing I always say in these sort of situations, it’s almost never the COBOL or the mainframe that actually triggers an outage like this.
Every single time I’ve been thrown an outage like this, it’s the Java, 100%. Something in that middle layer goes haywire and triggers a huge colossal outage. It’s like when your mainframe hardware fails, it’s like … mainframes are kind of like … they just pick back up wherever they left off. They’re very tolerant. Hardware does fail and they do have outages and that, but it’s like, “Okay, we swap the hardware out, we turn the mainframe back on, away to go.” Same thing with a COBOL problem. It’s like generally you just restart the job and away you go. When you see these massive outages it’s almost always in the middleware. So that becomes the question of, “Well, what do we do with all of this? If we’re moving it all into the cloud …” I’m not a huge fan of moving COBOL into the cloud. I feel like that’s a little counterintuitive.
My preference would be if our decision is to move everything into the cloud because we have decided that our primary interaction point with our customers is going to be through the internet, then that’s a situation where we look at rewriting the application as a whole. But even then, what I tend to do is that peel a service off and then integrate it back in and peel a service off and then integrate back in. These are the kind of challenges that we have and I think what’s lovely about these kind of projects is that there are no silver bullets. The technique I use in one that was incredibly successful could be the absolute wrong thing to do in the next one. It’s not always about the technology. Again, it’s a lot about the organization and what they’re willing to invest in.
You have an organization that isn’t willing to invest in their people for whatever reason, saying to them, “You need to rewrite it in Python and retrain your entire tech staff” is going to be an awful solution, because they’re just not there with you. They’re not willing to retrain their people, so they’re going to end up with a thing they don’t know how to run in the end.
I mean, what you said is interesting because I think that that aspect of retraining and how much effort is required from the point of view training folks is very critical. What I have observed or one of my thoughts around this migration story is that that back in the day when it came to running legacy systems or say mainframe, everybody needed mission-critical or those people who needed mission-critical fault tolerance, those fundamental aspects stayed on IBM mainframe and systems like that. But in today’s day and age, pretty much everybody needs some sort of mission criticality where the application has to stay up because if something goes down, somebody will go on Twitter and say, “Hey, this app doesn’t work.”
That affects the customer experience, affects the business, affects the revenue, affects everything. Gone are the days where you could take chance with having a, say, less scalable system or a system that would go down. Today it’s like table stake requirement to have a system that is fault-tolerant, that has a disaster recovery, is really, really near time RTO RPO. Those are the experience that folks are trying to predict. Recent years, have you observed folks just moving towards the cloud or moving to an architecture that is generally fault-tolerant for everything or they’re still trying to use the same functions from before?
Well, when you say generally fault-tolerant, I’m like I want to say I haven’t had many people who want to migrate onto Erlang, which I think is the most fault-tolerant. Well, little tiny shout out to the Elixir community there. There’s still a really heavy push to move to the cloud, which I’m not always convinced it’s the right thing, and I say that to clients. For me, the question of moving to the cloud is where does your data originate and when you’re done processing it, where are you going to store it long-term? Because if your data originates on the public internet, then sure, let’s move everything to the cloud because you keep it all in the cloud. But if your data originates off the cloud and then ultimately you’re going to use the cloud to do your processing and then it’s going to move off the cloud again, you are going to get killed with exit charges, just absolutely murdered.
I think a lot of people just take it on chase’s value of if the cloud will save you money. I’m like, “No”. Amazon is very, very good at hiding their expenses. Every modern day software engineer is it’s curious, this pain of the Amazon bill that’s the length of your forearm that has things you never imagined you were paying for. It can be a very good solution. I think it’s still a really good fit for probably 90% of use cases, but it’s not a perfect fit for everybody. It’s this question about what are your business processes, what are your values, where do you actually want to invest in general?
I have a complicated relationship with this idea of fault-tolerant because I always think about how things are heard by people who don’t have any experience with technology. I think when we talk about fall tolerant, we get into this habit of thinking about system failure as being one, a preventable thing and two, a bad thing. I don’t think it’s either. First, I think all technology fails eventually at some point. There is no technology that operates perfectly all the time. I think it’s good for people to sort of understand that and really internalize it and not be scared of it. We’re in the middle of doing a modernization project right now, and one of the executive stakeholders said to me, “Well, okay, this plan is fine as long as it doesn’t go down”, and I’m like, “It’s going to go down, man.”
It’s on a proprietary platform that I had never heard of that is so niche they actually have to issue special laptops to the people that need to access this system because that’s the only way it will run. I was like, “We are going to attempt to move it onto a commercial cloud provider and in theory we think we have all the dependencies figured out, we think we have all the environmental factors figured out”, but everybody who’s been through a first deploy of anything, even a brand new piece of technology knows the first time you turn that crank when you get out is not the nice sausage necessarily. For me it was just managing that expectation, that if your thought is like, “Oh, fault-tolerant means that we’re just going to turn on the new deploy and turn off the old deploy and everything will be perfect”, I’ve never seen that happen. In my entire career in software, I’ve never seen that happen. So it’s sort of [inaudible 00:28:54].
I completely agree. I mean, when was the last time we did not see an outage on an Amazon region in [inaudible 00:29:00]? It happens so frequently or happens in Google too. My thought process is that we are in the process of building the best possible fault-tolerant system that is available. However, shit happens and it can affect … what engineers and architects have to do is build a system that is most likely on all the time or has a scenario where things are always available. That’s what we are chasing. I mean, especially with folks we come across, they’re trying to work on mission-critical applications and we help them with our architecture because we have a peerless architecture and we write and replicate data across multiple regions and if you want to do it across multiple cloud.
But again, the challenge happens is not everybody is aware of the cost associated on the cloud. As you were saying, nobody talks about the amount of money somebody has to spend between data transfer or egress cost between AWS to Google Cloud. That’s something nobody thinks about. Well, people think about it, but they really don’t know until they get that first bill.
People who have gotten a long-ass bill think about it, no one else does.
They think about it when the bill comes and then they’ll ask the architect, “Were you not aware of this?” When you’re trying to do the first deployment, things can generally go down, but of course you’ve done so much work into it, you know how to fix it.
I think what’s important about managing these kind expectations is my concern is never will the thing go down or won’t the thing go down. My concern is is the thing going down going to mean that people get cold feet and cancel the whole project or cripple the whole project? Because it’s like that really then becomes … you end up worse off, you have kind of half finished work, which is just more technical debt, like sprinkles of technical debt on top of your Sunday of technical debt. I had a customer a couple months ago, they wanted to do a load test and I thought, “Great, let’s do a load test on the system. That’s excellent.” They’re like, “Yes, we’re going to test up to 100 million concurrent users”, and I had to be like, “Okay, that’s a third of the population in the United States. That’s more people than watch the Super Bowl.”
Sure, if they’re going to error on a side, error on the side of being over-prepared rather than being unprepared, error on the side of over-prepared. That’s fine. But I was concerned that when their infrastructure did not scale to 100 million concurrent users that they would then read too much into that as a signal. I was just pushing back on “Realistically speaking, how many people actually use this service at one time? Is it a third of the US population uses the service at one time or is it maybe only a million concurrent users use it at one time?” I think my concern with these sort of things is never really about is it going to fail? It’s more about what is the significance of that failure and will we lose some of the buy-in that we need if it fails?
When the answer to that is yes, then my emphasis becomes on almost [inaudible 00:32:27]. I sort of become a spin doctor. I want to be on the side of those executives and sort of hold their hand and go, “Look, it’s going to fail. It’s fine. We expected it to fail. Everything’s great. Don’t worry. Isn’t this wonderful? It failed.”
I think the key, to your point, is to level set with the folks you are trying to engage with. I’ve been in situations where we have to run performance tests and to your point, not everybody is running Black Friday level workloads every day. Not everybody is running the Super Bowl level workloads every day. However, you do have to build a system that can scale, of course, and what happens there is that folks say, “Well, I want a system like that”, and then we say, “Well, that requires this much amount of investment in terms of hardware and building”, and then that becomes a bottleneck. There is this negotiation where we come together on what is the ideal solution that is optimized for what you really need for running your business and for whatever objectives and goals you have.
I think this kind of experience is kind of rallied across multiple places. Every team right now that is working at different tech companies is having this sort of a conversation about infrastructure and investment and all these things with respect to the business objective. I like what you said. I mean, I think it’s true to where it is. Anyways, I wanted to ask you something. I know we are moving very quickly with time here about your book as well. You wrote this book Kill it with Fire, and initially when I read the title I thought maybe this is a fictional book or something because it felt like a Mockingbird, Hunger Games kind of title. Tell me a little bit about why did you write a book and what is this book all about and what led you to this?
The title, I admit, is a little bit of a bait and switch because it’s very aggressive. Not only is it called Kill it with fire, but on the cover is a dumpster fire and that was actually a requirement I gave my publisher when we negotiated my deal. I was like, “I want there to be a dumpster fire on the cover of this book”, and they were like, “Okay.” But yet the book inside, I am a huge fan of legacy systems, the ways people are fans of Antiques Roadshow. I think they’re fascinating. I love old computer systems. I’m not turned off by them at all and largely kind of have this celebratory attitude about, again, legacy systems are successful systems. The book is broadly about how to nurture such systems and restore them to a certain operations level, which is the exact opposite of what the cover suggests.
I think most people who go into technology today kind of envision themselves as working for a hot startup, building everything from scratch. A lot of times the narrative around what it’s like to be a software engineer kind of assumes you’re building something from scratch. I’ll say to professional software engineers, “When did you ever build … how often does that really happen in your career?” Normally you start a new job, you have these established code bases you have to figure out that’s much more common than you build things from scratch. All of the skills that you need when you are modernizing a system, when you’re working with a really complex legacy system are also really valuable when you’re working on newer systems. I wanted the tone of the book and the book to present itself in a way that was compelling to people who don’t think that they’ll ever be modernizing systems so that you could sort of absorb the knowledge.
Many folks who are in the process of what you just said experience a dumpster fire or that’s a common experience or term that we have. “Oh, well what are you up to today?” “I’m just trying to deal with the dumpster fire.” It’s relatable and I think that’s what I liked about just the title and everything. I’m going to read the book obviously because I ordered it on Amazon and it’s going to come and I’ll give you my two cents on it. Although I have not worked on old systems like COBOL a lot, but I do talk to a lot of people who are in the process of migration, and so hopefully it’ll help me. Thanks for giving me that, it’s hopeful for people like us. That’s cool. I remember, Marianne, when you and I were talking, we were talking about this whole idea of how systems have changed.
I was telling you about, “Hey, look at this. We had SQL to no SQL, now we have distributed SQL. We went from having infrastructure as it is to VMs and our Kubernetes”, and the pattern that I wanted to talk to you was this whole rise of the monolith, where we’ve gone from monolithic systems broken into agile systems and lambda functions and micro microservices or nano microservices, and now we are going back to this monolith. What’s your perspective on what’s happening with that part of tech right now in software engineering?
Well, two things in general, and I describe this as length in the book, I think people tend to assume that technology advances linearly. We get progressively more advanced as we move on and when you actually look at the trends in technology, which you start to realize is the cycle. It is moving to greater capacity, it’s moving to greater speed, more data. There are advancements that look linear, but the actual paradigms that we use tend to be cyclical. One of the stories I tell is that what kicked off that train of thought for me was working on a mainframe system and one of the curmudgy uni engineers kind of very dismissively said to me, “They want us to migrate. When they came in the ’90s, they wanted us to get rid of our thin clients talking to a mainframe and migrate everything to desktop applications, fat clients on desktop applications, and now they want us to migrate to thin clients on the cloud, that this is fundamentally the same thing. We used to be timesharing on mainframes, now we’re timesharing on Amazon’s giant massive compute environment.”
I realized that he had a point, the implementation had changed and a lot of the details of the protocols and things like that had changed. But fundamentally the paradigm of timesharing on a mainframe versus buying time on a compute cluster is the same. You can sort of see those patterns everywhere where we’re kind of rotating through these cycles and then go into a lot of detail in terms of the science behind why that is, what are the forces that sort of push us to shifting? I suspect that we’re going to see another shift within the next 10 to 20 years out of these cloud software as a service environments into something that kind of follows a private data center model but doesn’t look like a private data center.
So my person, if I was putting a bet on it, I would look at what we’ve been doing with mesh networks and a local first software because I think in addition to all of the economic reasons why I think we’ll be pushed there, I think there’s a huge desire on privacy and data sovereignty and a lot of discomfort with all of our data just being in giant a pool somewhere owned by a corporation somewhere. I think we will see those cycles again. One of the things that’s already happening is that optical computing is pushing us back into analog machines, which I find fascinating. I wrote a whole blog post about this on my Medium that I will give you the link so you can send it to your readers if they want to look it up. It’s like because of the structure of how light works, we are able to build better processors for optical computers if they follow an analog pattern versus a digital pattern. So that’s fascinating.
I wanted to ask you, because you were talking about this, what’s the tech or the tech trend that you are kind of seeing in the last, say, six to nine months or maybe last one, two years that you feel you’re really excited about for the future?
I’m really, really into formal methods. A large part of what I do is about how people reason about systems. I do the organizational part, but then once you get to the engineers, you’re dealing with a system that often, if there is documentation, who knows when the last time was updated. A lot of times we don’t have any tests. For me on a new system, the first thing I want to see are the tests. Ideally the unit tests, but if we don’t have unit tests, I’ll take regression tests, I’ll take integration tests, I’ll take manual test cases, I will take whatever I can get because this for me is the purest, most accurate form of documentation, is that description of we did this and the system should do this and not this. A lot of times on the older systems, none of that stuff exists and people are just kind of flying blind and they’re afraid to make changes because they have absolutely no idea if the changes are actually going to affect other processes that they’re not aware of and maybe the system will not even crash, but just make mistakes.
I’m very interested in how people reason about systems, how they build mental models about systems, how they transition that knowledge to other people. I think there’s a lot of exciting things going on in this space around using formal modeling to help people find bugs in systems, deduce the behavior of systems, manage complexity. What has historically always blocked that set of technologies from really taking off has been that the models are really, really difficult to write. Then once they’re written, the fundamentally the same problem you do with documentation, which is they’re out of date almost immediately and then they no longer reflect what the system looks like.
I’ve seen both new methods and approaches to how we apply those models. One example being the rise of chaos engineering and that kind of complexity testing has created a scenario where perhaps the purpose of the model isn’t implementation. Perhaps the purpose of the model is to do that baseline hypothesis testing so that you coming into a large scale failure test like that really having a good idea of how the system’s supposed to behave and exactly how you might be able to trigger failure instead of just turning things off and hoping something interesting happens. We’re seeing those kind of patterns, but what I think is going to be really interesting is a lot of this formal logic is arguably the first generation of AI and now you have what’s going on with the second generation of AI that’s based on statistical models and its power in generating content and translating content.
I’ve had multiple people come up to me and say, “Can we use the language model to write the specification or to write the logic model?” I find that idea really, really interesting because I think the answer will be kind of, but not really. I think the answer will be the language model can write the first draft of what the specification should be, and then you have to refine it with the details. But even if you get that far, that would just be game changing for a lot of technical orgs to be able to create these kind of models and run them as simulations how their systems behave. I get really excited about that sort of stuff.
I also feel like that’s a really good … I mean, what you mentioned is a great use of how AI can help companies, enterprises. It’ll be interesting how that shapes up. Already we can see certain patterns where AI is producing really great boilerplate code and again, it requires a good engineer’s creativity to turn that into a masterful code. I’m fascinated by that aspect as well. As we get into the close of the podcast, I was just thinking recently, a few people hit me up from Kenya and some people reached out to me from Australia. The podcast is kind of going across the continent now from North America and it’s interesting that lots of people who reached out to me are people who are building systems and who are really asking for, “Hey, how can people work on complex systems and problems and solve those things?”
I’m really excited that they get to listen to somebody like you in the next episode and they get to know how to deal with that. But as we go in, and shout out to Warren from Kenya who reached out to me, what is your advice for software engineers today who are trying to build systems and what are your one, two, at least two, three things that you think they should do on a day-to-day basis to make sure that they’re building great systems?
Well, I would say don’t go out and deliberately build complex systems. I know for a lot of people saying you built a system of certain size or a certain scale, that’s something you want to put on your resume. But if you have a choice between building a simple system and a complex system, definitely build a simple system. Then I would say that my biggest architectural north star is always capacity of the engineers and the engineering team. We were talking about the restoration of the reputation of a monolith and the trend away from services and microservices and things like that. I had always told people that the architecture of your system is determined by how many six people on-call rotations you can run. It’s six people because in order to have a nice healthy on-call rotation, I want there to be a fallback person so that it’s not just one person misses a page and everything’s a disaster.
You have a six person rotation, you can run a primary and a secondary, and you then can run a one-week on call where people basically are on call once a month. That’s usually pretty sustainable for people, it’s not too stressful and so that’s kind of my peg for how those things should be structured and comes a really great forcing function in talking about exactly how many services we can maintain without disaster. A lot of times what happens is if people are like, “Everything has to scale, everything has to scale, there should be nothing that can’t scale”, and it’s like, well, that’s just not realistic. By talking about the capacity of your people, what kind of work-life balance you want the people are going to be running this to have, that gives you something that’s a good solid reference point to start to plan your architecture around, it’s based on something you know. That’s always been kind of my principle for how to design systems. I think about how many people do I have and how do I best want to use their time.
I think one thing I really like about what you’re saying is the realness with which you have to approach building systems. You have to consider who’s going to manage it and operate it. And those are things that we don’t think about when we are building these systems right now. I’m glad that you brought it up and it’s awesome that we have people like you who are looking at these fundamental ideas and bring that into the industry and helping leaders understand this aspect as well. Thank you for doing that.
Awesome. I know we are up on time, so for everyone listening, Marianne has this book called Kill it with Fire that we were just talking about. It’s available on Amazon. I’ll link it when I release the episode out. She also can be reached out through LinkedIn and Bellotti.tech, that’s her page, and you also have a Medium blog. What is that called?
It’s Medium and then under my handle on most social networking platforms is Bellmar, B-E-L-L-M-A-R. It’s medium.com/belmmar. Maybe a sub-domain now, may have changed it, but it’s linked on all my other things.
It’s been awesome to have you. Everyone listening, go follow Marianne and I hope you’ve enjoyed listening to us kind of geek out about big systems here. If you had come at the beginning of this call before we recorded, we also had Marianne’s cat on the podcast. I’m pretty sure he is lurking around somewhere there.
He was. He was sitting right over here and then he kind of made this maneuver like he wanted to sit in my lap and cuddle and I was like, “Get away.” If you saw me gesticulating in the earlier call, I was me nudging my cat off, but now he’s decided that this is boring because I’m just talking about work and he’s off some other …
Well, for whatever it’s worth, I thank you and the cat for being so generous with your time on the podcast and I really enjoyed our conversation, Marianne, and I wish you the best with the book and with everything that you’re doing. Thank you so much.
Thank you so much for having me and I really look forward to your listeners kind of jumping into the conversation.
Big Ideas in App Architecture
A podcast for architects and engineers who are building modern, data-intensive applications and systems. In each weekly episode, an innovator joins host David Joy to share useful insights from their experiences building reliable, scalable, maintainable systems.
Host, Big Ideas in App Architecture