Observability & Statelessness with TripleLift’s Chief Architect

Dan Goldin

Chief Architect at TripleLift

Never miss an episode

Spotify
itunes
google
youtube

Database architecture can be a vital part of enhancing and maintaining customer success!  In this episode, we explore several crucial aspects of backend engineering including stateless systems architected for resilience and how to do observability well. Our guest, Dan Goldin, Chief Architect at TripleLift, shares his expertise, drawing from his years of experience at startups and in the ad-technology industry. For those interested in customer experience, Dan discusses the power of data and how to leverage it to optimize app interactions. For those interested in application architecture, he talks about building resilience into different systems at TripleLift.  Join as we discuss: 

  • The value of investing in “clean code”
  • How TripleLift gradually developed a strong data engineering function
  • The challenge and importance of using observability to enhance customer experience 
  • Building stateless systems for resilience
  • The cost of outages and architecting to prevent them

Tim Veil:

All right everybody. Welcome to another episode of Big Ideas in App Architecture. I’m your host Tim Veil, and I’m really excited today to be joined by Dan Goldin, the chief architect of a company called TripleLift. Dan, welcome to the show. I’d love to get started by hearing a little bit about you, your background, and certainly what you’re doing today at TripleLift. Tell everybody what TripleLift is and how you got there, I think, is a great way to start just to get things kicked off.

Dan Goldin:

Sounds good. First of all, thank you for having me. I met you before. I listened to a lot of the previous steps episode and I hope I can do at least as well and be as interesting to the-

Tim Veil:

No pressure.

Dan Goldin:

… other listeners. So no pressure. So once again I’m Dan. I had an interesting, I guess, story. So my dad is a self-taught computer programmer. He was a chemist and he got super into computers when I was a kid. So we immigrated from the Soviet Union in ‘89 and I think as soon as we could afford it, he bought a computer, the green text on the black monitor, pre EGA, I don’t know, pre it’s like EGA/CGA.

Tim Veil:

Pre everything.

Dan Goldin:

Along those lines, pre everything. And so I always got exposed to it, played the games, got internet around middle school and I actually think that’s a really good time because I saw the before era and then the post eras. Dealing with dial-up, taking forever for it just to load. In college you suddenly get exposed to fast bandwidth, that whole jazz. So in college I really had a similar mindset. For me it was, hey, my dad could teach himself code. I’ve always liked computers but let me focus on something else. So I studied a bit of math econ, used that to get a few roles after college in, it was called quantitative engineer. But these days it’s data science. So I joke, I was doing data science before it was called data science. I worked at a couple of companies ranging from a company that did a lot of data analysis for pharmaceutical companies. So that’s when I had to learn to use the tape drives and loading data and extracting them, running all sorts of analytics. Did a little stint of the quantitative finance world in the 2008 era where … Maybe we’re in this era right now, hopefully not.

Tim Veil:

Yes. And I do want to talk about that by the way. At some point I want to hear a little bit more about your experience there, because yeah, you’re right. I think we’re entering uncharted territories or uncertain times. So we’ll definitely come back to that.

Dan Goldin:

I was a bit naive then, but for me it was interesting because the usual day-to-day suddenly got interrupted and you’re like, hey, I’m suddenly doing these new interesting things. Obviously now in hindsight, I’m like it’s terrible for all the people affected. But first off I’m like, hey, I got to do some new interesting things. At that point I also realized finance for me, I joined this company in New York called Yodel, about 200 people doing pretty similar things. Quantum engineering. Got exposed a little bit to advertising and I spent a year there as a product manager near the end, wanting to get a bit more awareness of the business. Had the usual, I think, engineer trap of becoming very solution minded and not problem focused. So I would go out and talk to people, understand their problem and then go out and do whatever I wanted to do anyway. So it didn’t really work out too well for me. Learned I enjoy the engineering side a lot more.

Tim Veil:

Engineers never do that, Dan. That really shocks me that people in engineering would build what they want.

Dan Goldin:

It’s just so obvious. They don’t really know. Customers don’t know what they want.

Tim Veil:

They don’t know, only you know, Dan.

Dan Goldin:

Only I know.

Tim Veil:

Only we know as engineers. That’s funny.

Dan Goldin:

We’re all channeling our Steve Jobs. We all think, we’re delusional. So I ended up doing a startup after that and that’s like when I really learned a lot of the, I would say real broad, full stack software engineering. It was a company called Glossy. We ended up rebranding to Pressy. But if you think about the era, it was around 2011, 2012 and around then there’s a new social media network feeling, there’s a new social media network being launched every day. Our product was let’s sort of let people connect to all their social media accounts in just everything they’re posting, not sort of de-duplicate it. So you wouldn’t normally post an image on Facebook and you post the same image on Twitter and Instagram. Google Plus was the thing back then and we’d say, hey, this is actually the same image. De-dupe it and then make display them all on a webpage that would auto update as you post it across these social channels and really make this living breathing version of yourself.

For people familiar, it was kind of like an about.me was big at that time and Flipboard, which was that magazine style layout for your social media feed. So we tried to do that. We got a lot of interest from people who signed up, “Hey, this is awesome, this is great.” But no one really ended up using it, which sort of makes sense in hindsight. You connect, you get a cool experience and you’re like, well what the hell do I do with this thing? We ended up doing a little bit of a pivot selling to a few small colleges, high schools. And they actually had a real problem or a real use case. They would skin it, add their own logo, design it in their school colors and then position to their alumni parents. We ended up having a bit of a founder conversation. I think in hindsight we were all too sort of immature for it and we’re like, well business, we don’t want to do business, we want to do a consumer. So we ended up selling it to a pretty small advertising agency.

Another joke I say is for some businesses new life money, new generation money, new house money for me was like new skateboard money. So almost all the value was in learning. Because I come from more of a content background and that jQuery and AWS and dealing with web servers. So I think that was very, for me, a very formative experience and that’s when I ended up consulting for a bit, realized I really like being part of a team. And then I joined TripleLift where I’ve been over almost 10 years now and I could dig into that a bit, but I think I interrupted you.

Tim Veil:

No, no, I was just going to point out that I … because I do want to go back to your experience as a founder and going through that process. Because you’re right, starting a business can have all sorts of outcomes. But I do think, and I’m glad you pointed it out, that one of the most important lessons, I think, learned is just the experience of doing it regardless of the financial outcome. Just going through the process of starting something, going out on your own, having to deal with all the sorts of things that I think you’re protected from when you’re an employee of a company, having to sort all that stuff.

So it would be fun to go back and talk about some of those lessons not just from technology but just really the personal side of learning about what it takes to start a business and go from there. But before we go back in time, I want to hear a little bit more about TripleLift because I have to admit I’ve never had much exposure to the ad tech space. And so I’ve spent some time since you and I last talked doing a little bit of research about what TripleLift does. Obviously you and I talked a little bit about it, but it’s some amazing tech and you all are doing some really neat stuff. But just if you can spend a moment on what it is, what markets you’re serving, what you all are building. Because it was really, really interesting when I dug into it a bit.

Dan Goldin:

Yeah, this is the hardest question to answer because you’re like, “What do you do?” No, I’ll give a previous to why. I assume the audience is technical, and generally a lot of people say, Ooh, advertising, stay away from it. But a positive thing. But I was like it does power the open internet right now. Imagine especially if you’re not wealthy, you can’t pay for Wall Street Journal subscription. A lot of the content you get online is free and it is paid for through advertising. Not to mention a lot of small businesses, advertising supports their products and they’re the ones that have been affected pretty heavily by Apple’s sort of anti tracking work.

So where we sit going back to this, if you think of a typical ad, the way see it is you go to a website or an app or whatever the case is, it loads the content and there’s a little bit of JavaScript that makes a request to an ad exchange. That exchange then takes that request, extracts information like who is the user, what is the geography, what is the time of the day, what browser are they using? Fans it out to dozens of different, you call them DSPs, demand side platforms, and then they internally have all these proprietary rules and now there’s been logic for how do I choose the app to show and how much do I want to pay for it.

So there’s some companies and people joke around [inaudible 00:08:13] that tends to be retargeted. I see a pair of shoes and it follows me around the internet. So their bread and butter is coming up with these data science models. I should show the same shoes to someone three hours after they’ve seen the first ad and then again for next week and then they’ll sort of drop out. So companies specialize in that. Some companies are more focused on brand advertising. Like, I am Coca-Cola. You’re not going to go in click on the ad for Coke and buy Coca-Cola online. For you it’s more their job is to make a top of mind.

So there’s different companies that sort of specialize in these different approaches and they work with advertising agencies and brands in order to get dollars loaded in and work with them to figure out the target. So a typical campaign for Coca-Cola might be, I want to run this ad across the United States. That’s relatively simple one. You might have other ones which are, well I’m a Toyota dealership and I know trucks are more popular in, let’s say Texas, I’m going to run ads for trucks in Texas. I’m going to run, I don’t know, Prius ads in New York.

Some stereotypical example, but generally, and this all happens in real time. So you land on the webpage it sends out and every time you see an ad that was an auction run. And so you can think about all the compute that goes into it. And the most interesting thing is, or one of the more interesting things is the performance here. We typically try to get an ad to respond to render within 300 milliseconds. So we get a request, we fan it out, they have to run their evaluation, they send back a response, we apply various business rules.

So for example, a common one could be a specific publisher may say, “Hey, I have an exclusive with Toyota, I don’t want to show any other car ads.” So we could only allow Toyota in that case, block for it, and then pass it on to be rendered, displayed. And then we also track was it in view, was there engagement? And then that goes back and feeds into some of the data science optimization models. So we’re doing about 200 billion auctions a day. So it’s pretty massive globally.

Tim Veil:

Wow.

Dan Goldin:

Yeah, so it’s been this constant investment over the past 10 years. When we started, and this I think gives a little bit of a sense of the evolution of the product as well. So when we started we were doing what the industry calls native advertising. And the pitch there was you go to Facebook and you have these ads that match the look of the other content. And yet you go on the internet, even if you’re the New York Times and you’re a recipe blogger, you’ll see the same sort of ads, which doesn’t really make sense. A lot of these publishers spent inordinate amount of money and time to make their content look great. Why do their ads look so terrible and commoditized?

So for us, the initial product was, let’s work with these publishers to create unique templates. You may want an ad that’s square, that’s not typically what you see in the industry. So we would work with them to create these unique templates and work on, let’s say some computer vision magic in order to a crop resize images, in order to fit within the desired layout, inherit their topography and their style and this idea of deconstruct the ad components from the way the ad itself was rendered.

So that was our first product. We’ve achieved success there and over the years it’s been, are there other formats we could run? Let’s start doing video. There’s also even the transaction ideas. It used to be very, you can imagine, handheld. We needed to have a salesperson go to a brand or an agency and fight for every dollar. Now a lot of our work has been why don’t we integrate with these DSPs directly and then have their sales team sort of push TripleLift in order to make money?

So that’s generally the big story, and going forward, a lot of the work for us is how do we do more on video? So we have a pretty big connected television team and we acquired a company last year that specializes in privacy. So how do we take advantage of these privacy changes in order to create better and better products? I think it’s impressive, but when you think of a journey, it’s not like we were doing 200 billion auctions a day to start. It was like you start very small and build a marketplace and you keep growing.

There’s some funny stories. One of them someone’s like, “I’m willing to pay $2 to show a thousand of these ads.” So typically everything’s measured in terms of thousands because each ad is worth fractions of a penny. So we charged them $3 and they’re like, “Hey, how come we got charged more?” I’m like, “Oh you were part of our beta.” Lo and behold, everyone got charged more because we had a bug. So it’s like that’s sort of where we started.

Tim Veil:

Yeah. First of all, I think it’s so fascinating to talk to folks who have been part of these kind of lengthy journeys of companies. Because you do, you get to see the evolution, obviously of the business and those outcomes, but of the technology from very early to now. To me that sounds like an incredibly impressive numbers. I don’t know how that measures in the grand scheme of ad tech, but that number of impressions, or I can’t remember the exact term you used, but to me that that’s super interesting. So obviously you just shared one story about a bug or a mistake that happened, but can you talk just from an architectural perspective? As you said, I think you’ve been there 13 years, right, I think roughly?

Dan Goldin:

No, almost 10.

Tim Veil:

Okay. So quite some time. Without giving away, obviously, all the important details or company secrets, what was the architecture, if you will, of the technology that delivered this then versus maybe kind of share a little bit of some of the evolution of where you guys are now?

Dan Goldin:

So when I joined the company was about 10 people, and at that point it was half engineering and half commercial folk, marketing folk. And so when you start a marketplace, especially in advertising, it’s pretty hard to get immediate scale. So our first version, our first product when I joined, it was built on top of another company. And I don’t want to necessarily name names, but you could think of a company that provides roughly a way for you to manage your own … Think of it, we were in essence acting as a publisher, a website owner, and instead we received these requests and then sending them out through this network. But we had to do a lot of hacks in order to get this idea of native to work.

It was something designed for banner ads, but because we had control of both the website that it would be rendered on and we also had control of the ads themselves. We use a variety of JSONP hacks to be able to … Normally it wouldn’t work, because typically what an ad has is HTML. What we instead is we passed basically a callback, we passed back sort of a JSON call and we knew that the Java code that would be executed was already on the publisher page. So in that case we were sort of passing these [inaudible 00:14:50].

Tim Veil:

But that makes sense. You got 10 people, you’re trying to get something going, leverage the tooling, the system, the ecosystem that’s already out there. You didn’t say this term, but almost kind of like, hey, we’ve got to get a minimally viable product out the door. We’re not going to build for year 10 on day one. So that makes sense to me.

Dan Goldin:

Yeah, exactly. And at some point once there’s enough traction volume, you realize that, very much like we love talking about tech ed, but the decisions you’ve made just make it very hard to innovate and build forward. One mental model I really like is a sort of for, if you think about physics. There’s kinetic energy and potential energy. So it’s like you got to invest in clean code, so then you build up that potential energy, you have a product feature or enhancement you want to do. You release very quickly through kinetic energy, but hey you’re back in the bottom of the mountain and just invest in cleaning up the code base or even our case architecture a bit. So in our case, we spent quite a lot. We knew we were going to move away from it at a certain point and that’s sort of what we were doing, I want to say between 2014 ish to around 2016. And I remember that day pretty clear because it was my birthday, we actually did this sort of swap, but generally what we did is we built our own exchange. So in that case, if you think about a network dying era, it’s like our code. We would run the JavaScript and then it makes a request to their server and then their server does all the magic and then we’ve sort of loaded all of our content, all of our assets or all our rules into their system.

In that case we just did a swap and said, hey, let’s release this new JavaScript that’s going to live on these publishers’ pages and make a request to our system instead just changing single URL and then still have our system proxy basically was to the way it used to work back to these other companies' networks. And there was this gradual migration of all the integrations that were connected to them, let’s shift them over to directly integrate to our exchange.

And that was the joke of, hey, we’re charging too much, we’re charging too little. But we did spend a lot of time … Around there, it’s not as simple to build something from scratch because part of … Something that ended up being surprisingly, I would say interesting is before that we didn’t have to do any reporting. We would get reports from this other system. Now you have to collect … and we had all these tricks we would sample, hey, collecting data is expensive, just drop an event one of a hundred times and then assume and sort of approximate what you need to see.

And at some point we’re like, we actually need to collect every event. And then it’s like, hey, now it’s the rise of Kafka, now it’s the rise of a proper data pipeline. MySQL no longer works, at that point we invested in Redshift, started really a building up a proper data engineering function. It’s sort of interesting. When you’re small you generally have generalists to do a little bit of everything. And as you get larger you have specialists that understand the particular area and they have known expertise. And I think the challenge for businesses, how could you, I guess, grow them in tandem? Because you don’t want to have specialized needs without specialists and you also don’t want to sort of have the inverse. You don’t want to necessarily hire specialists that can’t really support where you are. So I think it was this healthy balance of real, hey, we need to build some data engineering skills, let’s build expertise and hire for it.

Tim Veil:

There’s so much to dive into there, because I think of the companies I’ve been a part of, and certainly at Cockroach this can be true too. When you start small, you’re right, everybody is a generalist, everybody kind of has a front seat at the show or a seat at the table, whatever you want to describe. And people get very involved. And then as businesses grow, you’re absolutely right, you start to need to have specialists and those people that were really good at everything and were part of everything need to pick a direction a little bit. Organizationally, I wonder if you found that to be somewhat difficult, where folks who had been very engaged in all parts of the company, the decisions, are now having to pull back and focus on one thing. I think that organizationally I’ve seen that being somewhat of a challenge where people are like, hey, I miss being part of everything. Now you want me to only do this? What happened? But I think it’s part of a natural evolution of companies is that you have to get specialized in order to keep growing.

Dan Goldin:

No, that’s definitely true. And we’ve had people leave for it, and I think it’s a honest conversation. And a lot of people say, “Hey, this is not the company I joined three years ago.” Some days I feel like that. I think to your point, it is natural. I do think a lot of people find interest in specialization. We have a great engineer who did a little bit of everything and he’ll readily admit that he doesn’t do front end and he’s become our expert working on this realtime bidding system and exchange and optimizing the low level Java.

It is sort of a good sign when you use the open source library, we use Netty and at some point you realize, hey, there’s this issue in Netty, I’m going to commit it upstream and fix a bug down in their system and that line. And it’s good. Some people like that, this idea of this depth and optimization. Other people do stay generalists and just like jumping around and fixing whatever comes and learning whatever comes.

There’s probably a lot of these parallels too. Open source probably encouraged this idea of specialization. Because if you have to write everything from scratch, think like 20, 30 years ago, you just weren’t able to create the amazing products we all create now. So it’s sort of the flip side. I don’t remember, someone told me or I read this, but it’s the idea of you compare it with medicine. You used to have a hundred years ago there was this thing called a doctor. And now you have cardiologists and psychiatrists and podiatrists. So maybe we’re seeing the same thing with software engineering. Obviously there’s this whole AI angle too. And what does that mean? I do think you want to give people opportunities to discover and see what they enjoy and then give them work that they both want to master but also that supports and grows the business.

Tim Veil:

One thing I also wanted to touch on, because again it’s very top of mind for us and I think it’s top of mind for a lot of the companies that we talk to. And interestingly enough, I got back yesterday from the Gartner Data & Analytics Summit down in Orlando. And so to some extent I think this was a thread throughout a lot of the conversations and that’s observability.

So you kind of said, hey, we were using this tool, they were providing us all the reports. I had all this great insight into what was happening. When I move away from that, I have to build my own. I’m using the term observability, you may have a different term, but I know building a product here at Cockroach, that’s been one of the really key and important things that we’ve done. It’s one thing to build a product, but if you don’t really know what’s happening inside the product, how it’s being consumed, are issues occurring, are they not occurring? All sorts of stuff. That observability becomes, I think, really critical in building a wonderful product. What are your thoughts on that? Has that been as an important piece of your puzzle as I think it has been for us?

Dan Goldin:

I’m actually very glad you brought it up because I feel like definitely so, and I’ll give some angles and maybe some thoughts on it. One thought is it’s rare, and maybe from what I’ve seen, it’s rare that a product manager would sort of prioritize observability. And I think we got lucky because we didn’t have product managers for quite a while. So a lot of the time it was engineers getting exposed to problems directly. The first time an engineer gets an issue they’re like, oh screw it, I’ll just deal with it. Second time they do it, whatever, I’ll deal with it. And then the third time, they’ll be like, let me just build a solution for it. Good.

There’s a lot to be said about just escalating things directly to engineering teams because they really are the most capable of solving it. And I think we’ve invested a lot in these self-serve tools just to help us address problems. Some examples of the ones we’ve done is we created … for every auction there’s a lot of steps. There’s what we receive, how do we enhance it? So an example of enhancement we do is when we receive a request, it typically has a cookie to identify the user. But that’s unique to our domain. Because cookies are tied to a domain, and we need to map it to the partners we work with. Because if you’re, let’s say a buyer like The Trade Desk, you don’t care if it’s TripleLift user 155, you want to know that it’s Trade Desk user A, B, C, D, E. So we have to take this request, do a very quick lookup, we use Aerospike, so a pretty fast database used in ad tech, augment the ad request. Then we attach also rules to it, which buyers are eligible? How long did then they take to respond? What creative did they respond with? How much are they willing to pay? Who paid the most? Various data science predictions around performance. And then we determine when.

So we created a pretty clever idea, like a debug auction. If you knew the right tricks and you were on the VPN, the right network, you’d be able to add something to the URL and you’d get, maybe if you printed it out, it was probably a 20 page doc of all the deals that happened. But with the way we did it was this work in product especially, it was pretty hacky and some point people realized, hey, this is actually incredibly valuable. And someone at some point had this idea of, hey, I’m going to create an interface called Debuggable. And so you just added right to a particular bit of code in our execution chain and lo and behold, suddenly we have pretty robust reporting auction, and that’s for a particular auction.

The other big piece we’ve done, I actually think it goes … Everyone’s going to have their own reasons for success, and it could be right, could be wrong, but I think a big part of it was our investment in observability and especially co-locating commercial metrics with engineering tech and metrics. Because it’s relatively straightforward to take Datadog, New Relic, Honeycomb and they give you a ton of instrumentation on the code, although albeit it’s pretty expensive. But it’s a lot of work to then go in and say, how do I also push in the commercial metrics we have?

So in our case it would be something like, how many dollars are we making per second by partner? For our publishers, how much are publishers making per second? For our ads themselves, what is their render rate? Like 98% render’s great. Is there a combination where it drops below 90%? In that case, throw an alert and we use PagerDuty for things like that. But a big part of it is really getting into a single system because then you can say, oh I can understand why. Or, spend dropped, oh because latency in our US west region increased this much, this much. One or the other. Especially when you are dealing with incidents and stressed out and everyone’s trying to find answers. Getting that information available in broad presence is incredibly valuable.

Tim Veil:

No, I think obviously you know guys have such a focus on the customer and consumer experience, but even for us it’s the same kind of thing. This ability to, I think, combine the data of the product, the insights, how are we serving this request or various other kind of technical things that we’re monitoring with other more customer-facing, user-facing interactions and being able to pull all of this data together to have some insight into whether things are going well or not well, whether this is an area of the product we need to focus on or pull away from.

It’s so critical, but it’s so hard to do, I think, well. I think this is one of the things that people who have really cracked the SAS algorithm have figured out how to marry all of this data about interactions to make their product better. And it’s difficult. It’s one we’re working really, really hard. And then I think to your point, we were lucky in that very early on we made an effort to track and observe lots of things within the product stack that maybe others would not have had the forethought to do. And it’s helped a lot. Because if not, as a former engineer, I was so panicked that … and by the way, I consider myself-

Dan Goldin:

Was an engineer or is an engineer?

Tim Veil:

Yeah, I was going to say I better correct that because I actually still enjoy writing a lot of code. But I was always so panicked that something I did was going to cause issues. And you’re lucky if the issues that your code creates are obvious because it’s easy to fix. But a lot of times the issues that we find that really plague people are very, very difficult to identify. And again, having really well thought out observability helps give some confidence that if something does sneak in, you can get to it and resolve it.

Dan Goldin:

We benefit in some cases from … Every industry has its own challenges, but what’s, I would say easier than most for us is at the end of the day you’re dealing with ads and each ad is once again what’s fractions of a penny. So if you show an ad or don’t show an ad the wrong way, you just have a majority right at the time. And when that has afforded us is this ability to test in production. I know everyone has pretty strong views on testing and how to do it. We benefited a lot from basically diverting 1% of our production traffic to what we call a staging environment. Staging also means very different things to different people, but for us it’s connected to the production, I would say state, but it’s running a new version of the code.

And it’s useful because so many, probably very similar to you, it’s the problems, the issues happen on the edges. In our case it could be something like, oh, in this version of iOS on Safari, this video format on this publisher is not interacting nicely with their HTML. They upgraded, let’s say their jQuery library. It’s those sorts of examples and we have a long ways to go to make that better. I love this idea of this pure self-learning system that says, hey, this combination of things used to be perfect and all of a sudden I’m seeing weird anomalous behavior.

The problem with that is that there’s just so much noise and how do you differentiate the two? But that is an area we leaned in pretty heavily on relatively early. It’s like we benefit from being able to run in production. So instead of trying to come up with unit tests for every single combination of things, let’s just come in there, they’ll look at the high level stats, collect all the data we can, make it easy for our team to dig into it and analyze it.

Tim Veil:

At Cockroach, one of the things that we talk a lot about is resilience and obviously from a database perspective, creating a data architecture that can’t go down. And the reason we do that is we find that at least our customers outages are a big deal. You don’t want your database and ultimately your applications to go down. It can be a serious thing. How do you all think about resilience of your architecture? What happens if and when things go down? What do you all do about that or how do you think about those problems? In some industries it’s not as important as others. I’m just curious how you think about those kinds of topics.

Dan Goldin:

So for us it’s for sure a journey. So when we’re out, you can think of the impact to us, what happens if we’re down? So as a company we don’t make money. So that’s one risk. Risk two is our customers don’t make money and advertisers can’t spend money. In some cases we do have obligations for customers to, especially depending on the type of product they use. So once again, we never do as much as we can. And probably similar to you, you probably have quite a lot of outages that you look back on and you’ll be like, what were we thinking? It’s this constant sort of period improving things.

The way we think about it is maybe … I’m probably going to maybe give you a high level view of four different systems. And I’m sure there’s more, but one way of thinking about it is we have the code that’s generally going to be living on publishers' pages and rendering the ads themselves. That’s JavaScript, it’s static. We use a CDN for things like that. We have this real time bidding system and that needs to save up all the time because that’s listening to these requests, running these auctions. And we’ve spent a ton of time making that resilience. So I mentioned we’re running in four different regions, we have regional resiliency there. So if one region goes out, we’re able to shift traffic automatically to another region, it’ll fail over to another region. And the other, I don’t want to say it’s innovation because it seems maybe obvious, but we basically try to make this application as stateless as possible and minimize any right it needs to do.

So the only output it produces to a database is Kafka. So we’re using a Kafka database, but it does not connect to my SQL, it does not connect to anything. Instead what we have is a separate application that sort of reads the state from all sorts of places. It reads it from my SQL, we use Mongo, we have a few other services that we connect to. So it pulls a state, creates these giant portable objects, uploads them to S3, and then it says, hey, these instances on the edges and we’re running, on any given day it fluctuates day over day or hour by hour depending on the volume, but about a thousand, between 1,000 and 2,000.

What they do is they get an update, hey, I have a new list of ads, or I have a new list of rules for my publishers. Fetch them from S3, load them internally and say suddenly you have a new state. So that’s kind of nice, because it means each application is stateless if you didn’t have data updated five minutes ago, whatever. You used the most recent healthy state.

So this idea of not coupling your systems too much and understand … I guess investing in the graded performance. What does the graded performance look like, and how do you avoid completing another failure? We have a data pipeline, and similar to that, if our data pipeline’s delayed, it’s okay so long as the data’s in Kafka and makes its way to S3. And at that point we have a lot of our batch jobs take over. And then the last piece is most of the ones from, I would say sort of managing the state, various UIs and APIs where customers internal and external could go and then modify these rules. And if that goes down, the rule, it’s not terrible because what that means is just this is what’s going to run on the most recent state. But new ads or new campaigns ends up sort of blocking future revenue, but it doesn’t necessarily break the existing operating performance. I don’t know. Does that help?

Tim Veil:

No, absolutely. And I’m curious about one thing because again, this is something we talk about a lot I think, or people ask us about is, and I’m just curious, do you test failure? In other words, do you, as part of a regular process, try to kill these parts of the application and see how they respond? Because I think one of the things we found is that people oftentimes will talk about, hey, I need to have this scenario or be resilient to this. But the real world testing of catastrophic failure doesn’t … It’s not easy to do by the way. It’s really actually quite hard to do. I’m just curious, and we’ve talked to some other folks who are starting to think pretty creatively about how to inject failure into their systems. Just curious, is this something you guys are routinely testing or is this … And maybe by extension, have you experienced really significant outages and what did that look like?

Dan Goldin:

I guess to answer your first question, not as much as we probably should. And I think part of, and I’ll give you an example of failure that we did have. So part of why is over the years we’ve run into so many issues that we’ve sort of resolved them that way. And I think we have a good mental model for the way the systems communicate and we’ve sort of decoupled them in a way that makes sense.

So in our case, in some cases, I know microservices used to be big, maybe not as big anymore, but that ends up being a much harder thing to reason about. You could actually think of our architecture as being a bit more of a hub and spoke maybe. So there’s this hub, this sort of real time bidding system and you could think of it as it’s a pretty beefy job application. It’s not necessarily a monolith, it’s a giant … but it’s close. It just does a lot. And if any system on the outside fails, it’ll just keep serving ads, doing its thing and doesn’t really matter.

So in that case we have monitoring online integrations and then you fix the system that feeds into it. But the system itself is sort of stable and we’ve invested a ton in making sure. That’s why that example of rolling out the staging, the 1% of traffic, we’ll also run it … When we deploy, we deploy to each region one at a time. Deploy to Asia first because that generally has lower traffic, look at the performance, compare it. We have various Grafana dashboards that look at stats and spend per second.

So once again focusing on a lot of these commercial metrics and if things look good then we’ll roll it out to Europe and US West Coast and US East. So that’s been our model. I think part of it’s because our industry is, I would say a bit more tolerant of failure. If we were doing FinTech or MedTech, we probably wouldn’t necessarily have a similar approach. We’d probably be much more invested in real failure. But I think for us it’s generally … and we’ve been okay. The most recent major outage we’ve had, and this is embarrassing, but like I mentioned, we have-

Tim Veil:

It’s okay. Everybody has them, Dan, it’s all right.

Dan Goldin:

Yeah, yeah. So we use data science models to bring more value to customers. So an example could be we try to predict what would the performance be of a particular ad and then we try to give our customers opportunities that we think their campaigns well in. And lo and behold, our data science team uses, sort of like we use Databricks and Spark to do some training and Python, I forgot … We use ONX. So ONX, I don’t know if you’re familiar with it.

Tim Veil:

I’ve heard of it.

Dan Goldin:

It’s an interface where a data scientist could run code in Python and train a model in Python. It serializes the results of that model. It could be in random forest or regression into this ONX format. And then our exchange, which is written in Java and sort of use JNI to evaluate C code, sort of loads that in and you get that fast evaluation. Because you don’t want to be evaluating in Python, especially for this realtime bidding system to massive scale, but it scales for you to do training in Python. Because that’s a one-time operation.

So in that case, lo and behold, the Python library got updated automatically because, hey, that’s what good open source is. You released a new version and it caused some break in this Python C evaluation and that took down a good chunk of us and what we’re doing. Because we just simply didn’t pit in version. And it’s one of those things that we didn’t … and the fix for was obvious. You and I could brainstorm half a dozen different solutions. What is it? Oh, well obviously the real thing is you should pit in every version. What else? Oh well you should not launch this model. Try to read it first before deploying it. Basically simulate it in the production environment before you actually push it to every instance.

It’s like, hey, don’t let the model crash. In that case, wrap it in the exception or something. The problem is this Java to C interface. Those are the things that cause problems. Because C’s, that’s not the most memory safe language. And when you’re dealing with one language to another language to another language, that tends to be difficult for us.

Tim Veil:

So just out of curiosity, for you personally, I know you’ve mentioned JavaScript a few times, you just mentioned Java. Are there particular languages that you kind of have gravitated toward? Are those kind of the principal languages that you all use? I’m a former, or I keep saying former, I’m not a former, I might go write code today. What am I talking about? Future engineering.

Dan Goldin:

Future. Future and current.

Tim Veil:

Definitely Java. I definitely used to do tons and tons of JavaScript. So I’m just kind of curious if those are technologies you all still strongly believe in or have you moved on to all these other fancy things that people like to talk about now?

Dan Goldin:

So we generally let teams have some freedom in what they choose. And I think it tends to be around their particular, I would say industry standards. So the data team that does data engineering, they use Spark. So they’re a pretty big user of Python and Scala. Because that’s generally what seems data science using Python. Our full stack teams will use TypeScript and JavaScript for a lot of what they’re doing. And this realtime video team is pretty deep into Java. And I know Java has a bad rap, but it’s also because-

Tim Veil:

Not with me.

Dan Goldin:

… 20 years ago it was pretty verbose. These days you get quite a lot of performance. It’s great performance. It’s not as verbose as people think. It’s becoming more and more functional. Some of our code in Java right now looks very much like modern clean Scala code. We’re using a ton of, I’m going to butcher a bit, but we’re doing a fair amount of reactive programming, leaning into spring flux, digging into … just making it easier for us to really write code very much in JavaScript. You have a series of callbacks, but it looks a lot more elegant and clean and supporting in Java. So we’re-

Tim Veil:

So do you guys use Spring?

Dan Goldin:

Well, parts of Spring, but generally trying to do more of that. But yeah, we’re big fans. We used, I mentioned Netty under the hood. And I don’t know if you’re familiar with it, but it’s great technology.

Tim Veil:

It’s been around for a while too, hasn’t it?

Dan Goldin:

Yeah, it’s been around for a while.

Tim Veil:

Pretty stable. What JavaScript libraries are … I think you mentioned early jQuery, which brought back lots and lots of good memories. I’m trying to think of what some of the other ones were way back in the day that we used to … There was Dilmo or I think-

Dan Goldin:

Oh yeah, yeah.

Tim Veil:

Something like that. Yeah, and there was some other one. What was it that … It’ll come to me. But JavaScript’s fun. TypeScript I haven’t quite gotten to yet, but I know that’s kind of where a lot of things are headed. It just makes it, I think, a little bit easier.

Dan Goldin:

I have so many thoughts on front end. I don’t know, front end, once again, my background is more on the backend side of things, but front end is so complicated. It’s sort of sad. Because everyone’s like, oh, front end, that’s the easy thing. I actually think it’s incredibly hard.

Tim Veil:

It terrifies some people.

Dan Goldin:

Yeah, it’s tough. One is everyone’s going to have an opinion. If you have something hidden away in the bowels of Java codes, no one’s going to know. But if something looks off on the page or you want to reuse a component here and there, how will it actually look like?

A lot of what we’re doing is moving a lot of our applications on the front end from very old angular to modernize to next. And that requires which functionality is even used. Oh, we never really instrumented that much of the front end. We’re like, how do we have multiple teams? Oh, we’d like to have a consistent experience across multiple products. How do we organize our teams to do that?

Hey, what if one team needs to upgrade to a different version? How will that work in this single page app with a slightly different team that’s not ready to upgrade? How do their components work? I think for us it is somewhat of a new muscle because we’ve done, I would say JavaScript, but that has to be much more of this run any browser across any environment. Not this world of, hey, let’s create amazing customer-facing apps that have pretty complex workflows.

Tim Veil:

No, I totally agree with your sentiment by the way. I think people think front end is just writing HTML code or something in CSS, but some of the most talented engineers I have ever known, terrified, absolutely terrified of doing front end work. Because it is, it’s incredibly complex, and I almost think to a fault at times. Some of the tooling and technologies that have become popular I think in recent years in the front end I know are enormously powerful, but almost like we make things a little too hard to read and reason about, which isn’t necessarily a good thing.

Dan Goldin:

But that’s engineering. It comes in cycles. Even a lot right now. There’s a lot of this work around combining the front end, the back end, trying to create single framework that … Next does a little bit of both. It’s like lets you also create the back and forth. But yeah, I miss, I guess back to jQuery days, being able to make a code and refresh your page and lo and behold things work. Now it’s like Webpack and build and …

Tim Veil:

Yeah, it’s crazy stuff. It’s crazy stuff. And the other thing too is, you mentioned it, I found myself in the work that I do now almost going back to more simple front ends and moving a lot more toward just monolithic applications. There’s a lot of value in just being able to build something quickly and not have to introduce all the complexity of all this other stuff that sometimes happens with microservices.

Dan Goldin:

There’s a theory that microservices are more of organizational, it’s solving organizational problems rather than technology problems. And I think that’s with most things, obviously there’s shades of gray, but I think there’s some truth to it. Our approach has been approaching some of these services. I’m saying microservice because it means different things, but just being reasonable about services and understanding boundaries. Part of the problem you never really run into is people disagree on what the boundaries are and that causes a fair amount of problems. But you could still write your code in a way that makes it easy to decouple it if needed. So that’s been our approach, think more about the interfaces, try to make the method calls. Try to be a stateless, try to adopt a lot of these better practices. So if we need to split it at some point we’ll be able to, rather than sort of prematurely optimizing.

Tim Veil:

This religious zeal around, it must be this pattern or this framework. It’s like, ah, come on.

Dan Goldin:

I think when you get bitten by too many things you become more of a, I guess, pragmatist. You realize.

Tim Veil:

I agree. Well listen, Dan, we’re kind of running up on the hour and I hate to take more time than I deserve. Maybe just as we kind of close out, talk to us a little bit about … Because there’s so many other topics I wanted to get to by the way, but maybe we’ll have you on as a guest for a second time.

Dan Goldin:

I’d be glad to, this is fun. It’s, I think, better in conversation just going back and forth because I inspire a thought, you inspire a thought.

Tim Veil:

Yeah, you do. I have mental notes of like, oh man, I wish we had time to cover this. This is really interesting. But I do want to leave time because I’ve ended a couple of the podcasts like this. For us, it’s the beginning of our fiscal year, although maybe I can’t say that as much anymore because time is progressing rather quickly.

Dan Goldin:

You’re going to have to come up with a new question soon.

Tim Veil:

Yeah, I may have to come up with a new question, but nonetheless it’s spring, the flowers are blooming, it’s a time of optimism and looking forward in future. What are some things maybe that, whether at TripleLift or even personally, some things you’re kind of looking forward to for the coming year?

Dan Goldin:

Yeah, I think given everything that’s going on, it’s easy to be a bit bearish. But I also think there’s always opportunity and constraints. For us, we have a couple of different things. One is that, I’m sure finance will want me saying it or not, but it’s generally for a lot of companies, it’s this idea of could we do more with less? How could we be more cost efficient? And I don’t know, I like this idea of cost optimization. I generally enjoy it. Are there ways for us to optimize this system? Do we even use this system? Hey, oh we used to not sample here. Hey, let’s start sampling and see what the impact is.

So I do think using this opportunity as a way to get better at the operational excellence because when the market recovers, which it inevitably will, we’ll be in a much better position to execute. So I’m generally excited about just bringing more rigor to our cause. We get to work in pretty cool technologies. We’ve rolled out spot assist more and more. We’re on AWS, so being able to re-architect our systems to support spot, that’s exciting. The other big one obviously is more on product side. We acquired a company last year called 1plusX and sort of their pitch, I’m going to butcher it somewhat, but the area they play is coming up with privacy safeways of creating audiences. You saw, I don’t know, you read these few articles, you must be interested in cars, that sort of idea. And we acquired them in order to build this great product together and we’re sort of reaching this point of closer integration. Coming up with this joint product, launching it. So it’s exciting to see how that goes. There’s always excitement on our new product lines and connect the television. I don’t know, the way I think about it is yeah, the overall market may contract, but it’s sort of our chance to get more market share. So it’s maybe bigger slice of the pie and the pie will continue growing and will expand there. For me personally, I don’t know, I wish I was younger and had more energy, like fewer kids, but just everything that’s happened. Like AI, I guess I’m a techno optimist and I used to … I’m sure a lot of people will have vendors reaching out. It was like, “Hey, we use AI/ML.” And four years ago I’d be like, this is some garbage. AI/ML is a buzzword. There’s obviously something real here. I’ve used it, I like Copilot, I’ve been playing around with it. So I was like, what does it actually mean? And I don’t know, some days. And the other awesome, you have this meaning of both terrifying and now it’s amazing. So I have both of those. I fluctuate between those from one minute to another minute.

Tim Veil:

Oh, I tell you. This summit this week, I did the loops around the exhibit hall. I don’t think there was a single booth, maybe with the exception of ours, that didn’t reference ML/AI in some way, shape, or form. It seems to just be taking over the entire vernacular of what we’re doing. It’s interesting, exciting. I still don’t know what to do with it all, but it’s certainly out there.

Dan Goldin:

I’m an optimist. I think we’ll figure it out and I do think it’ll hopefully help all of us do our jobs better.

Tim Veil:

And so last question. I always like to poke at people, especially people who have books in their background as I do. Is there something back there that you love? It’s kind of your go-to book? Or is there something on there that you haven’t read yet that you’re getting to? Or is this just decoration? Any answer is good.

Dan Goldin:

I’m generally not a decorative person, but there’s also some of the books are my wife’s, but I don’t know. So the book right above my head I like, it’s Edward Tufte. I don’t know if you’re familiar with it, but he’s a big infographics guy. And I don’t know, for someone that does backend, I have a very good appreciation for good design information density and how could you showcase information in a good way. So that’s something I enjoy leafing through. It has good production value.

A book I’m currently reading, it’s not on the shelf because I’ve shifted to Kindle, which I sometimes regret. I wish I could have physical books, but I wish I didn’t have to move them or store them. But I like the feel of a book, but it’s Amp It Up. Because some people … it’s by the CEO of Snowflake, Frank Slootman. And I don’t know, I read business books. For me, they’re like watching an action movie. I sort of know what it’s about. I don’t have to be engaged. It’s light reading. So I like it, but I don’t know, it’s forced me to think a little bit differently.

He’s very much this straight shooter. Even by the name you could get what he’s getting at. But instead of thinking about next week, think about today. Just very much get rid of a lot of the nonsense and the optics of what we’re doing and instead of focus on great work. Align incentives, make sure your sales team and your support team and your engineering team are all working towards the same goals, that sort of thing.

Tim Veil:

Yeah, I agree with you. I love reading the business books too. I’ve got all sorts of different … I try to read a couple different books at the same time. I’ll have one that’s just fun and entertaining and then one that’s like self-improvement. I hate to use the term self-help, but just something to help make me better.

Dan Goldin:

I go through sci-fi books too, periods where I’ll binge on them and read a series. Because I know as a kid I think it’s somewhat sad. Because as a kid you’d be like, oh, the science of this unknown. And these days maybe there’s a little feels, maybe there’s a little bit less ambition. Maybe not. But 50 years ago, or I guess a hundred years ago when they had all sci-fi books, time travel and this and that.

Tim Veil:

I know. It’s fascinating stuff. A lot of good books out there, by the way. It’s good, it’s fun to think about the what ifs, what the future holds. Well, listen, Dan, this was a wonderful conversation. Again, thank you so much for taking time out of your busy schedule to join us. I really enjoyed our conversation. And again, hope maybe in the future we’ll get you on for a second episode to finish some of the things we started.

Dan Goldin:

Well, yeah, thank you again. I really enjoyed the conversation with you.

Tim Veil:

Thanks for listening to this week’s episode. If you’re a fan of the show, be sure to subscribe to the podcast to get every new episode in your feed as they’re available. Also, rate us five stars on your favorite podcast platform. If you like what you’ve heard, you can also watch Big Ideas in App Architecture on our YouTube page linked in the description. Thanks again.

Big Ideas in App Architecture

A podcast for architects and engineers who are building modern, data-intensive applications and systems. In each weekly episode, an innovator joins host Tim Veil to share useful insights from their experiences building reliable, scalable, maintainable systems.

Tim

Tim Veil

Host, Big Ideas in App Architecture

Cockroach Labs

Latest episodes