Observability in the Cloud & Dataflow Modifications with Yolanda Davis from Cloudera

Yolanda Davis

Principal Software Engineer, Data Flow Operations

Never miss an episode

Spotify
itunes
google
youtube

In today’s episode, we’re joined by Yolanda Davis, Principal Software Engineer at Cloudera, to talk about Apache NiFi and its role in streamlining data transfers.

Yolanda explains the architecture of Cloudera’s Apache NiFi and the UI developers use of the tool to design and manage dataflows. She also shares how Kubernetes has changed the game in terms of efficiency and scalability (while acknowledging the occasionally immense challenge of growing K8s).

We delve into the challenges of observability and monitoring in the cloud. And how Cloudera is using the insights gathered from monitoring. Yolanda also shares some unique observations about the way she sees the public cloud being leveraged for scaling efficiently for performance and cost savings.

Join as we discuss:

  • Cloudera’s Apache NiFi and how it streamlines data transfers, operates within the cloud, and deploys dataflow 
  • Exploring the combination of observability and AI
  • Scaling efficiently for performance and cost efficiency 

Tim Veil:

Welcome again to another edition of Big Ideas in App Architecture. I am thrilled today to be joined by my good friend, Yolanda Davis, who is principal software engineer at Cloudera, focusing on the Cloudera Data Flow Operations Team. She’s the team lead for the Data Flow Operations Team. I had to look at my notes to get this correct, because I didn’t want to mess up the long title, but Yolanda, welcome to the show. Very, very glad to have you on.

Yolanda Davis:

Thank you so much, Tim.

Tim Veil:

Oh, you’re quite welcome. So, in previous episodes, the way we usually like to get these started is just to learn a little bit about you, about your background, about how you got into this crazy business of leading teams, building applications. You and I have known each other for a while, so I have a little bit of history, but I would love for you to share with the audience a little bit of the Yolanda Davis story because it is so interesting.

Yolanda Davis:

Yeah. Well, I was one of those kids in the ’80s that really tinkered with computers. Actually, my mom, she noticed, okay, I love Barbie, but also I was dissecting Barbie. I was taking apart Barbie’s cars. So, my mother being a social worker, she’s very aware and observes a lot of things with kids and she said, “Okay, my daughter’s a little different. Let me see what I can get her into.” She didn’t know at the time that the computer itself was going out. It was being discontinued, but it was at an affordable price. Have you ever heard of the Adam computer? It was sold at an equivalent of a Best Buy. I think it was called Best back then. She brought that computer home and I was just all in.

I didn’t know there was such a thing as being a computer programmer, but I knew I wanted to figure out how to make this thing work. I was about 9 or 10 years old at the time. Fast forward to when I was going into college, I actually went as a mechanical engineer, spent a good two and a half years hating it. That’s no shade to anyone who is a mechanical engineer, but I always thought that engineers, that eventually they would write programs. So, that’s why I got into it. A good friend of mine at least in college, we were in the same scholarship program. He looked at me and was like, “You hate engineering and you talk about computers all the time. Why don’t you become a computer science major?” I was like, “Well, wait a minute. That’s a thing?”

Because this is 1996. I didn’t know that you can actually go to school for that. If I didn’t say it already, this is at the University of Maryland College Park. That changed my life. I talked to him that afternoon. I made a call to the dean of that school. I got transferred the next day and the rest is history. So, I have been an active developer since 1998. I think looking back on it, I’ve done a lot of different things. Tim, even before you and I met, gosh, it’s been almost 14 years, Tim, since we worked together. But before then, I had spent some time working for the state at the University of Maryland while I was getting my first master’s degree. During the .com era, I was working for one of those type of companies, doing a bit of application engineering support type deal.

I spent some time as a subcontractor with the government. It’s been a while, but several different agencies there. I was a consultant for many years. So, within that six-year timeframe, right before you and I met, I did everything from working with small companies, startups, you name it. But the single thread that has been consistent for me in my career up until the time you and I met is I loved

programming, but I really loved working with data. So, a large part looking back on things was I was even writing reporting tools or systems that would integrate data way back when I would work with MQ series in order to do basic enterprise messaging and that thing, ordering systems. When you interviewed me, I think the first project you had me do was visualizing data.

So, I always had that neck and I tie it very closely to one of my favorite classes in undergrad, which was database. May he rest in peace, Dr. Jack Minker, he was renowned in that area. That was the first class when it just clicked for me. So, I just took that everywhere I went, but I didn’t want to be a DBA. So, I found positions and opportunities where data was always central to that. So, once you and I started working together, I always talked to you about this new area, big data at that time, this new area, data science. I wanted to get all in because I always saw it as the next phase. We have always had this history of how to best manage data, how to make it perform well in terms of when you need to search and query things.

Of course, there’s always been technologies like O Lab, I don’t know if anybody really uses that as much, but at least the traditional sense, where we wanted to create these warehouses in order to do some analytics on it, understand current state and maybe trends. But now with that introduction of big data, it got into what I was always interested in. It’s like how can we forecast or figure out and predict what’s going to happen? Also, you and I, we watched the technology change and shift. So, once you and I parted ways, but for a moment, I spent some time, but for a moment.

Tim Veil:

Just a moment.

Yolanda Davis:

Yeah, I went to Concur that was later bought by SAP because I wanted to chase that data journey for myself. I really wanted to get into data science. I was still taking a class here or there. Then finally, here you come again, grab an opportunity at Hortonworks where it was like, if we last this long, that’ll be good enough. I really weighed that opportunity, because I was at a place where I saw a lot of potential in how Concur from the travel perspective and that reporting perspective. But then there was a difference between that and working with the people that are creating the software and the technology and getting into that space. So, my thinking then was, “I’m going to take a huge risk,” because I didn’t know any of the stack at that time.

I knew how to write applications, but not when it comes to distributed platforms and everything that supported it. It was all net new with a very trying project. Well, we figured it out. So, my first phase at Hortonworks was, as you’re aware on that professional side, I spent nine months working on that PS side. But then the balance there was, “Okay, I’m understanding the stack a bit more, but do I really want to get on the road?” But then while at Hortonworks, there was a little company that everybody was talking about internally called Onyara who was led by Joseph Witt. Next thing you know, Hortonworks buys Onyara, which they have a lot of specialists who were part of the committers of the Apache NiFi Project.

While I was still in professional services, I thought to myself, “Are there opportunities for me to land in engineering where I could contribute?” Then now that this new NiFi team was formed, they were looking for people to work on frontend or backend work. I’m like, “Okay, I’ve done both. I love data.” Apache NiFi helps to solve the problems of how people will obtain data from various sources and route them with some level of reliability with the interactive command and control, which is key there. I said, “Oh, this is a good opportunity.” So I’ve been hanging out with Joe Witt and the rest of his much larger team than when we started ever since now since Hortonworks now merged with what we now know as Cloudera and it’ll be eight years in June.

Tim Veil:

It’s an amazing journey. I know for a fact that the Cloudera team and certainly Joe Witt’s team is incredibly fortunate to have you. I mean I’ve said this to you before many times, you’re one of the most gifted all around engineers I think I’ve ever met. I don’t think there’s ever a problem we couldn’t put in front of you. It didn’t really matter what the space was, the technology was. If Yolanda is on this problem, it’s going to get solved. So, I know they’re very lucky to have you.

Yolanda Davis:

I appreciate that, but also, the strengths on this team has been outstanding. I started off going from PS into more of a senior role and working my way up, but the strength that I think I was able to exercise is the ability to collaborate well. So, there’s a huge difference when you have such a strong team and solid leadership that makes all the difference in the world, I think.

Tim Veil:

So I know you and I both know a lot about Apache NiFi and certainly Cloudera as a Hortonworks employee myself for a long time. But maybe just for the audience who may not be as well-educated and versed on these technologies, I mean maybe give us just a brief overview of what… I mean, I know it came from Onyara, but what is the problem space? What is Apache NiFi out there doing? What is it trying to solve? Then maybe talk a little bit about how it fits maybe into the overall larger Cloudera product offering, because I know that’s evolved quite a bit over the years as well.

Yolanda Davis:

Well, I’ll start really how NiFi was formed. So, the project itself came out of the NSA, if you can imagine during the getting data business for various reasons. But those who started that project, they were trying to solve this problem of they were writing code all the time in order to create jobs or whatever streams of data, to get data from, whether it’s a database or some other external resource and land it to someplace else with some level of reliability. Some people who were involved, they would always have to go to a particular person or go to a particular developer. So, Joe and folks basically said, “Well, what happens if we were to make it a lot easier for people to design how they want to transfer data from one place to the next no matter what the source or the sync will be?”

To make it very simple, have this interactive command and control, how it’s called, or just a user interface to do that work. It’s a little bit different from if you’ve ever worked with some older school reporting systems or ETL jobs. I know we’ve worked with some in the past where you have something where you design the job, the workflow first, and then you submit that job and then it executes. Well, NiFi is different. It’s interactive command and control. So, you can design where you want data coming from, how you want it to land, the level of reliability or retries all within that platform or within that UI. Then you can say, “I want to run it immediately without submitting some job.” That’s what distinguishes it from traditional ETL.

Tim Veil:

Yeah, there’s very much a real-time aspect to it. So, I can get data flowing through the system and I can drag these widgets around, at least this is my recollection from a couple years back, drag widgets

around. I can add forks and joints to this stream of data and I can do that all in real time as opposed to in effect, defining some job, compiling it, pushing off to server. I’ve got this interface where I can watch this data move back and forth and I have complete control in real time on where that all lands.

Yolanda Davis:

That is what is a huge distinguisher between, I find, a lot of other similar products where you have to have a programmer, a developer. So, what this does is it takes it out of that maybe traditional data engineering team. Now, for example, we haven’t gotten to it yet when it comes to the data science stuff I do on the side. I can put it in the hands of data scientists where they just know simply how they want to set up their data, where they want their data to land.

So, it basically broadens your user base and simplifies these things that you find yourself creating over and over again. I think that was another thing too. I’m doing these same activities over and over again, same source of same things. NiFi brings all of that together along with some of the guarantees and liabilities that we would expect ensuring that when you send a message, it will be received and how you can set up those guarantees for delivery. So, that’s it at a very, very high level.

Tim Veil:

Yeah, it’s really cool tech. I mean I was using it years ago and I was really blown away by it. I know it has to have evolved and grown even since then. I would imagine too and you would know better than I at this point, I mean, I’m sure there’s lots of other people claiming to do the same thing. But I think the interesting thing about NiFi, at least from my perspective, it was the first. I mean it was the first in the space that was really, really targeting this, “Hey, I’m going to make this super easy to organize the flow of data in my system.”

Yolanda Davis:

Yeah, it was definitely the first. Early on, it might have been 50 different, what we call, processors or different sources. I think now you have hundreds that the community has contributed to help go from any types of points of sources and syncs that you can possibly imagine. You can also expand to say it’s not just about sources and data at rest, so to speak. It’s also if you wanted to kick off another job or integrated with other downstream, more streaming applications, machine learning, APIs, there’s a whole broad possibilities with NiFi. To answer the other side of the question, how does it fit in the Cloudera ecosystem? Really, you see it almost in the beginning of the story.

When it comes to the hybrid cloud or the hybrid platform that we talk about, whether or not you’re on a cloud or you’re in private space, how can you connect the dots? We are in the middle of that. So, it’s not just a story of land everything all in Cloudera. We know that’s not the whole truth. People will have different aspects or use different providers for different things. So, NiFi helps to serve that bridge to send data to all different types of places, especially within the cloud. So, I think that’s why it’s a huge part of our broader platforms story.

Tim Veil:

I think what we’re finding, certainly, even though I left Hortonworks, I’m still very much in the data game. I mean, Cockroach being an operational distributed database. I mean even it needs solutions like NiFi because data is a little bit everywhere now as you hinted at. I mean, there is data in relational or operational databases, there’s data in analytics, there’s data in streaming technologies. It’s just out at the edge and centralized places in the cloud and moving data between those places, keeping a handle on it, understanding where it’s going, rerouting it if necessary to other places as new technologies emerge.

These aren’t simple problems to solve. I think especially when you get to data at the scale that Cloudera’s operating, Hortonworks was operating over massive amounts of data, you need to have this really efficient command and control. You’ve mentioned resilience and reliability a lot, and it can’t go down. You can’t lose data. Data can’t end up on the floor. This can be really, really important stuff.

Yolanda Davis:

Also, to add to the story, we’re talking about NiFi and its interactive command and control, but that’s not the only part of the NiFi story. So, there is MiNiFi, which is also part of the project when we talked about a smaller lightweight form factor of how do we capture data that is out there on various devices. Even now, we have the support of NiFi as a function, I think we call it internally, but basically how we can run NiFi within the Lambda space. So, the serverless space.

Tim Veil:

Oh, really?

Yolanda Davis:

Yeah, really. So, there’s a lot of different form factors that NiFi comes in. Some of those things are even more actualized within our CF product, which we actually offer. Of course, NiFi itself is available as an open source. However, within Cloudera, our cloud offering is Cloudera Data Flow, which is powered by Apache NiFi. So, we support almost like a platform as a service aspect of delivering NiFi and it allows people to deploy their flows. Actually, we have something now, a more recent release has what’s called a flow designer. So, you can design and deploy your flow within the cloud-based environment.

Tim Veil:

Without having to operate your own infrastructure.

Yolanda Davis:

Yeah. I mean that’s a huge thing. I mean it’s part of my operations lead, my team is charged with how NiFi operates in the cloud and some of the scaling and observation or monitoring that we do.

Tim Veil:

I was going to ask you to maybe go into a little bit of what your current role is, because as you alluded to at the beginning, you’ve been at Cloudera for eight years. You started in professional services with me and have gone on too much bigger and better things. Tell me a little bit about the current role, because I know you’re doing some more operation stuff, maybe less engineering than you had originally been, but I don’t want to describe it for you. I want you to tell me exactly what the role looks like today.

Yolanda Davis:

Oh, sure. So, I would say my role looks like it’s more of setting a vision for and supporting and mentoring my team in terms of how well we can operate NiFi in the cloud. What I mean by that is it is not just ensuring that it performs well for flows, but also, what does that look like? How well does it scale in order to meet whatever amount of load? What does the scale look like, whether it’s horizontal or vertical based scaling? I know everybody talks about observability. I just want to get vulnerability in place. So, we’re always reviewing what metrics do we expose, which is one of the people on my team, they’re exposing more metrics such that we can not only take advantage and see what’s going on from if something is going wrong perspective, but also, I have a huge interest in being able to forecast the need. So, there are certain hooks there that already exist.

So, for example, for those where anyone who works with Kubernetes, there’s something called our Horizontal Pod Autoscaler, for example, that will allow you to by default configure based off of CPU. But we have a lot more richer metrics from the application itself. So, we’re setting things up so that we can scale on things such as if the internal flow gets backed up for any particular reason, being able to scale such that we can handle the load that it needs to handle. Then once we get to that point, then it’s more about, “Okay, how can we start recommending the footprint based on the load?” So it’s been a journey, at least in my view, of ensuring we have the right infrastructure. So, I know you love to ask this question, Tim, what are we using behind the scenes?

So we use and rely heavily on tools like Prometheus in order to collect our metrics. It’s not there for a long-lived history of metrics. It’s really more of like a short-term collector for long-term goals. So, we use it in order to collect the metrics and then alert on issues, but also we react based on those alerts. So, it’s not just showing to the user. We might put things in a particular state based on an alert. As I mentioned, we are going to be scaling based on other things that metrics find. So, a lot of the work that myself and my team does is part of that story of how well NiFi behaves from a scaling perspective and being able to project that.

Then the other side too and this is the area I say I am not the expert, but I did a great job as well as my managers in interviewing folks who their background were creating applications that basically support NiFi within the cloud, which is based in Golang. I have some of the Golang, but I made sure some folks on my team were the experts. They’ve been doing this for several years in creating operators that know how to extend the Kubernetes API such that we can declare we want a NiFi running. As a journey for me is just understanding programming as it relates to Kubernetes. I’ll say in that regard, I have learned just as much from my team and the experts on my team who support us in that.

Tim Veil:

It’s tough, isn’t it? Don’t you think?

Yolanda Davis:

No, I mean I’m not a young person anymore.

Tim Veil:

I mean, I just did another podcast and we were closing up on this idea of Kubernetes. It’s come up in various things. I mean, I struggle with it. I struggle with the terminology, the API, because at Cockroach, we’ve dabbled in building various operators and want to extend it and do all this other stuff. So, my team and I have, at various points, to dig deep into Kubernetes. I fancy myself a reasonably intelligent person, and I’ll sometimes read about this stuff. I’m like, “What the heck are they talking about here?”

Yolanda Davis:

It’s going into a declarative mindset. I think that’s the very lowest level. When it started resonating with me, which is when I understood, “Oh, databases are declarative too.” So SQL as a language is a declarative language. We’re not writing you when you run that explain plan and see everything that’s doing behind the scenes. Then also, understanding that internally it’s like a state machine. It’s always

trying to maintain state. So, the code is structured that way. I think for you and I, that’s like a game changer. It’s a mind shift.

Tim Veil:

Yeah, it has been tough for me and to the point where now I’m just like, “Okay, I’m going to let somebody else really get into these details.”

Yolanda Davis:

I’m probably now only in recent time and if my team watches that, they would laugh because they know, have I really been in that reconciliation world in the operator, but at the same time, I look at it at the overall vision. Because there’s certain fundamental things that I know it should do. I need to ensure that persistence is there. I need to ensure that the networking considerations. So, when they would raise anything up to me, you still go to the foundational things that make sense. Because once you figure out how to spin the thing up with your operator, there’s still certain things that it just should do. So, a lot of times when they would talk to me, it’s about those things. I’m like, “Okay, are we dealing with mutual TLS when we talk between one component and the other?” It’s those things.

Tim Veil:

I just always walk away feeling like, “Ah, it doesn’t have to be this complex. It could’ve just been named a little differently. It would’ve made a lot more sense.”

Yolanda Davis:

At this stage of the game, I’m okay with not knowing everything, but what I do need to know is at the end of the day, does it help to solve the problems that we want to solve? When it comes to the world of Kubernetes in terms of the level of reliability that the platform offers or the framework offers and it comes to some of the self-healing traits that we know we want to take advantage of, especially when it comes to how well we wanted to perform, we just have to roll with the punches. Honestly, I’m not mad at it internally operating as a state machine. I mean, at the end of the day, its good job is to always ensure that it’s at a particular state. How it does it gets interesting. Then Golang, I got a certification in it. I did, but it’s a whole other ballgame.

Tim Veil:

So Cockroach DB is written in Go.

Yolanda Davis:

Yeah, yeah, I remember.

Tim Veil:

So I was like, “Look, I’m going to order the book. I’m going to read the book.” Amazon shipped me two for some reason. So, I have two copies of the Golang book. I haven’t gotten through either one of them. Not that I would read them both, because of the same book, that would be silly. But they sit in different corners of the office and I think, “Well, if I’m over here, maybe one day I’ll be inspired to progress.” I don’t know, as a longtime Java person.

Yolanda Davis:

Yeah, I know.

Tim Veil:

Maybe I’m just getting too old for it. I don’t know. I want to learn it. I think I should learn it. I think I can learn it, but I haven’t done it yet. Kubernetes is the one thing that’s pushed me in that direction given some of the operator work because all the operators are written in Go.

Yolanda Davis:

Yeah, the classes mind you, what I took was more on the Coursera. I took that certification, but the style in which I use for those type of things is different from the style and the frameworks. There’s certain things where I would look at Go and I’m like, “I’m probably not going to use this or need this.” Even in the code base that we have, I’m like, “We never used that stuff.” But I’m sure there are certain applications of it where it absolutely makes sense, especially some of the event things that you can do with Golang. But sometimes I do put it on a level of Scala where when I did Scala, I’ve done it a couple different stents and I’m like, “What?”

Tim Veil:

I agree with you.

Yolanda Davis:

Scala is super powerful. I have a better appreciation for it now when I’ve seen it used in the context of data science for some of the libraries, but it’s so powerful, you can really get yourself messed up. That’s where I was like, “I don’t like getting in trouble with languages. I can see myself getting in trouble with that one this pretty easily.”

Tim Veil:

So I wanted to go back to you’re working really closely with observability. Observability is such a hot topic right now, and I mean I know it is for us. Just curious your overall thoughts on observability and it’s important to products and development or maybe Cloudera’s overall philosophy. I just know at Cockroach, I don’t think there is a customer we’ve had or a prospect we’ve talked to that isn’t curious about our observability story. I think it’s like 5, 10 years ago, we weren’t asking these questions. I don’t remember us ever worrying too terribly much about observability. Maybe it’s longer ago than that, but now it seems like it’s one of the hottest areas. I mean, what are your thoughts on it?

Yolanda Davis:

Well, personal hot take, I think observability is huge because we’re more in the cloud and it’s just harder to see things in the cloud. Just to at least level set on how the observability I work with, different from data observability. So, I’m focused more on watching our components, what we use operationally, which is different from watching data and then the history or the providence associated with data. But I personally think because it’s so hard to see things in the cloud in interoperability and when we talk about observability, I hone in more on the ability to monitor first, which I think is a subset of that.

The things that I know I’ve read, which I personally agree, is there is a subset of things that you know you want to monitor, which is different from having access to things or making your applications such that it can be observed, the metrics that you might not know about and the things that you might need to discover that you’re not aware of. That’s an interesting balance in the cloud, because eventually, that

data has to go somewhere and it costs money. So, for us, so when I think of the CDF story from it, observe how we can assure that our product is working well, it’s ensuring that, “Okay, how well is it running? Well, first off, is it working?” I know that seems like well, isn’t that just-

Tim Veil:

Is this thing on?

Yolanda Davis:

You’d be surprised because there’s a difference. Parts of the application might think, “Oh, yeah, that’s up, that’s running.”

Tim Veil:

That’s a huge issue. That’s a huge issue.

Yolanda Davis:

There’s still exceptions, and you don’t know it. So, how do you monitor that? How do you even know? So just finding new ways or additional metrics exposed just to get that information. So, that is at least the story as I’m working with my team to improve how well we monitor things and then not only seeing what’s happening right now, but then the analysis. So, when I want to get these metrics, even if things are fine, how can I collect that? Where is it going to go long term? Thankfully, we take our own medicine, so we do use Cloudera data warehouse internally, which is our data warehouse product.

I know that I can send some of that data internally such that I can start analyzing for some customers. That’s the growth plan, but overall, at least I want us to get to a place where we can anticipate what is going to happen and not only anticipate, apply immediate remediation. That is what my opinion is, the real story is, because you can watch it all you want. What are you going to do about it and how are you going to eliminate a human from doing the action? That is really my vision for how we operate is really for us not to have to call support or a customer not have to.

We should get to a place where that is rare and either empower the customer to take an action or better than that, automate that action, whether it’s through a scaling, whether it’s through an autotune. We’ve had some recommendation based on how you’re running, we’re going to change the footprint. The pie in the sky is how can we adjust or tune your flow or make recommendations. That’s really where the superpowers come in.

Tim Veil:

I think I was sharing with you when we met earlier, I mean meeting with Andy Pavlo, who’s a professor of databaseology at Carnegie Mellon. He’s over the last couple years started this company called OtterTune, which is looking at using AI to monitor databases to determine which knobs to tune to gain efficiency or performance. I think this idea of this combination of observability plus AI to not just alert you that something’s wrong, but to take decisive action on your behalf to either make an improvement or eliminate failure or anything. I think this is to me where I think the future’s headed.

Yolanda Davis:

Yeah, and not even failure, costs mitigation. So, you are not just scaling in order to handle for performance, but you’re also scaling in order to fit within a reasonable cost point. So, I think it has tons of potential.

Tim Veil:

Well, and I think cost, we talk about this a lot and I’m sure you all are talking about this a lot at Cloudera. I mean, we’re talking about data. Data is the lifeblood of most organizations today. If data becomes unavailable, if it’s corrupt, even if it’s not down, but the response time for the tools that rely on this data is so slow that it might as well be unavailable. I mean, these things have real cost to businesses. If I’ve built an infrastructure on Cockroach and Cockroach can’t respond, that’s a huge problem. If I’ve built an infrastructure that relies on NiFi to move data and NiFi goes down, this is a huge problem. It has real cost associated with it. I don’t know if this is true, but just reflecting back on my own career, I think we thought about it that much 20 years ago. Maybe I didn’t. It was just like, “Ah, these things are going to work and I’m not worried about it,” but boy, a lot more on line today, I think.

Yolanda Davis:

Well, I think too, just looking back at least beyond when you and I are working in educational space, the type of software we’re building, the risk wasn’t as high. A lot of times, we would work for a place and they were the consumers of that software, wherever we worked, but when you get into a space where you’re creating software, where others may use some things, which could be mission critical. To say that way, the federal government is also a customer of ours. So, there might be things that we don’t know about where reliability could be a huge risk and it’s just not a monetary loss. So, yeah, I think it’s not that those concerns didn’t exist. I think now, the scale of those concerns is a lot broader.

Tim Veil:

I think so. I think it’s very true, the scale and just how important data is to more and more organizations really is the lifeblood. I think, again, failing to keep it consistent, keep it alive, keep it available, huge impact. So, I wanted to switch gears on you just a little bit. Well, not too terribly much, but you recently went back to school.

Yolanda Davis:

Yes. At a certain age, I laugh now. Thankful, it’s done for now, for now.

Tim Veil:

I mean you and I have talked about it a bit here and there, but just for the listeners, I mean, what were you going back to do and what was that experience like? Because I know there are a lot of people who get to our age and are like, “Man, I wish I could…” I don’t want to say hit the reset button, but you’re always learning, always should be learning. So, maybe just walk us through what that was like, where you went, what you did, what you learned, and what ultimately you want to do with it.

Yolanda Davis:

Yeah. So, I mean to pick up where I left off, at least on my data interest story. So, while at now Cloudera, even though I’m within the operations team, I thought as I work with my team and as I help develop these products or this product, I thought I’d have an opportunity to really understand a bit more about data science and machine learning. I just didn’t have a structured way to do that. I’m the type of learner where either I’m going to have a class, which as you know, I was always taking some course. I always took some course with someone somewhere. I’ve done that for as long as I can remember at least, especially since college.

So, I got to a point where, at least in my career within Hortonworks/Cloudera, I was a manager at the time before I switched to be more of ICE/principal where I was working with NiFi but in a different vein. I was like, “I’m not getting any closer to my goals that I came here with, which was to learn more about data science.” So I decided to go back to school. I evaluated several programs and I specifically landed at the Harvard Extension School, which is the remote program for Harvard under the Harvard University system. I chose Harvard because they allow you to try it before you buy it. So, basically, their model was you could take… I think it’s still these two classes. If you do well, that’s part of your acceptance.

So, I’d already gotten to a known school with a great reputation, but I wanted to be able to try the program. What was great about it was extremely rigorous. I was questioning my decisions, but I made it and I did well in it and I wanted more. It was one of the best decisions I feel like I made. It took me three years when I graduated all of their degrees that the extension school issues is ALM. It’s a master’s of liberal arts in extension studies of whatever subject. My subject is data science. It allowed me to take classes, some of which who were joint offered under Harvard University, and then some were under this extension program with people in industry, because that to me is the most valuable.

So, I took everything from early data science, just putting a process to the work. It is completely different from software development as a process. I think of it more of being like a scientist in the lab. You’re going to have a hypothesis. Sometimes the data that you have is going to allow you to achieve that hypothesis. Sometimes it’s going to disprove it. So, it’s more like that mode. Then I took a lot of different things under data science, which I know it’s a hodge podge of different areas. So, I’ve taken an AI course, I’ve taken deep learning courses, I’ve taken deep learning for NLP. I found that where I lean towards the most or the things that excite me the most is less about some of the things that you see now with deep takes.

So, less about computer vision or generative based models, but I do have predictive analytics based things. I’m very interested in taking regular structured data and just getting insights there, but I’m also interested in natural language. I had an awesome professor, Chris Tanner. He was a lecturer at Harvard. I think he does stuff at either Stanford or MIT. I’m sorry, Chris, if you’re watching this. He’s moved to a couple of different universities in the Cambridge area and I learned a ton from his class in particular. But to answer your question, where I see myself going is I’m signing myself forward again to it. I’m going to try to apply to either a PhD-based program or doctorate program.

Tim Veil:

Really?

Yolanda Davis:

Yeah. I am learning that there is still a huge space for businesses to really understand how to apply this research and not jump on the next hot thing. There’s a lot of considerations around ethics in AI and bias that can be introduced that I think is very important and almost paramount for businesses to understand and incorporate into their process. I think there’s opportunity to grow within that space and contribute in that space. Another thing actually, you don’t know about this, part of one of the last things that I did at Harvard was working with a small nonprofit called Wild Track. They actually do work with detecting endangered species through tracking their footprints using machine learning and computer vision.

So, I’m continuing my work with them on the side as if I don’t have enough things going on right now. So, another part of my interest is helping nonprofits get on board with this technology, because they’re doing really cool things in that space when it comes to echo diversity and just applying machine learning and artificial intelligence. Yeah. Then in my day-to-day, like I was telling you, us forecasting scale, that’s always been the road for the last two years. Now, we have a lot of the pieces are in play to leverage. How can we use time series based data in order to inform how we scale that thing? So yeah, it’s a lot of different applications, but I’m crazy. My hope is next year sometime, I’ll be shaking my head in front of some professor at the ripe age of 48.

Tim Veil:

It’s amazing to me. I think it’s so impressive that you were able to do that and wanted to do that. It’s so important to keep learning and there are so many fascinating things out there in the world.

Yolanda Davis:

It’s amazing. I mean, the amount that this field has changed even within the last five years, Tim. So, to be prepared, my thing is to not as much be up on the technology, but to have a strong foundation for it. But that’s no different from when you and I started, right? We got our degrees.

Tim Veil:

No, I agree. You mentioned this ethics and some of this AI and biases. I mean, I think there’s a real potential growing problem there.

Yolanda Davis:

Yeah, it’s not just growing. It is here. Yeah.

Tim Veil:

It’s pretty fascinating stuff. I think these are disconnected things, but certainly things like ChatGPT, which you hear about more and more is I think but one extension of this growing dependence on artificial intelligence machine learning. Those are all driven by data sets that may or may not carry with them.

Yolanda Davis:

Yeah, embedded biases.

Tim Veil:

Embedded biases. There’s not a lot of governance there.

Yolanda Davis:

Definitely not like a broad governance. There are definitely people who are on the ground fighting the good fight. Timnit Gebru, she has an organization that’s focused on this work and studies there. She’s actually an ex-Googler, created her own research foundation to do research within this area. Joy Buolamwini, she’s also somebody who put this on the forefront in terms of facial recognition. So, those are only two people just off the top who are not only vocal, but are challenging from a policy standpoint, putting checks in place, but for now, the best thing for businesses is to be forward thinking. If they are creating models, do you have model cards? If you’re creating model cards, which basically describe here’s what this model does, here’s the risk of using this model, here are the data sets that were applied, and even those who contribute data sets.

I think it’s called data sheets or data cards that basically describe what’s the risk of using this data, but on top of that too, there’s certain techniques around de-biasing that could be employed. That was one of the things that I learned that I didn’t know about going into the Harvard program. That was such a thing. I always thought, “Well, it’s a factor of the data that you use,” but there are also techniques that people are using to actually de-bias models. I think the moral of the story is I think what will be important moving forward is the grassroots effort. Businesses willing to follow these particular standards so they won’t get burned quite frankly. So, it won’t come back to them like it has for several of even the bigger players out there.

Tim Veil:

I think it’s important when we’re going through these very, I think, transformational periods certainly related to technology, that you do take a step back and say, “Yes, it’s important to move really, really fast.” I think a lot of people are doing that, but with all of these innovations, some good can come up, but some harm can come from it too as well. I think it’s important, I think, to be very mindful as companies are adopting these new technologies and that they’re not recreating some of the challenges of the past.

Yolanda Davis:

I think it’s being mindful, but also make it part of your process. You can think, “Oh, yeah, I should do,” but part of your process should be if I’m creating a model, understanding its usage, understanding the risk, build that in as part of whatever auditing or part of your development process or your model development process. In that way, just like our software lifecycle or how your development process, it becomes a built-in thing that is checked for as opposed to something that maybe somebody might add and somebody might not.

Tim Veil:

Yeah, no, I think it’s really important. Well, I know we’re running up on the top of the hour and I don’t want to take too much of your valuable time. So, maybe as we bring it in for a close here, maybe it’s two questions and maybe it’s one answer, maybe it’s two different answers. But just curious, as you look forward to this year, obviously, you’ve accomplished a lot with certainly the degree from Harvard and all the things you all are doing at Cloudera, but what are some things that you’re looking forward to in the next 12 to 18 months, whether it’s you personally, whether it’s at Cloudera? I mean, what are some of those exciting things on the horizon for you?

Yolanda Davis:

Well, I would say for me, I’ll speak to, at least within Cloudera, within CDF, some of the things that I’m excited about. My team, we’re now getting into a space where we can do predictive analytics within the ecosystem. So, even though it’s not quite what someone sees when they interact with it, it’s more of having more a greater reliability and be able to project scale. That to me in the next 12 to 18 months is paramount. It will hit the nose on what, at least for me in my role, I’ve envisioned since the end of 2020 of terms of possibilities of what we could do. Then outside of that, for me personally, I am super intrigued even though it’s a lot of checks and balances, but when it comes to ChatGPT, there’s a lot of things that people are doing that I’m like, “Man, if this was around, I would’ve done…” I’ll give you a perfect example. Gosh, Victor Dibia, he is a researcher at Microsoft at he’s actually ex-Cloudera.

He recently came out with something that allows you to automatically do exploratory data analysis. Anybody who does data science, they know one of our early steps is you want to analyze the data set

that you have and generate your visualizations and give some summaries on things. Well, I think he created something that built on top of a large language model to generate. You tell it, “Hey, I want you to create this EDA.” You give it a data set and it spits it out. He just published a paper on that and I was like, “I want to try these things.” Yes. So, there’s all these different applications and things that I would love to get into. So, my hope is by 12, 18 months, I could figure out how to either tap into that technology or even empower if it’s the day-to-day work. I do have some internal thinkings for NiFi that I’m going to keep that one to myself of how that could apply, but yeah, we’ve seen some interesting things that large language models can do in that space. So, that might be a thing as well. You’ll never know.

Tim Veil:

No, I agree. I mean, I was just at the Gartner conference in Orlando a couple weeks ago. The guy in the booth behind us, he was using a ChatGPT style interface to just query databases. It was just like SQL’s a very natural way of communicating with data or databases, but just spelling out in English language. What do I want from the data? Then it returning in narrative form as well as a table of the data as well as the SQL query used to execute it. I thought it was incredibly fascinating.

Yolanda Davis:

Yeah, it’s very cool and at least the study that I did empowers me with, “Okay, now I know how that works.” So the growth path for me is, okay, how do you take research what Victor has done and others and apply it to various things that maybe is net new or how can we build upon it? So yeah, it’s very cool stuff seen in action.

Tim Veil:

Well, Yolanda, as always, I really enjoyed our chat today. Thank you so much for joining the podcast. Hopefully, we have you on again sometime in the near future.

Yolanda Davis:

It would be my pleasure.

Tim Veil:

This was a blast it, talking to you as it always is.

Yolanda Davis:

Likewise too. It’s so good to see you.

Tim Veil:

Thanks as always for listening to the Big Ideas in App Architecture Podcast. If you like what you heard and want more, tune in to our webpage linked in the description below. Give us a rating of five stars on your favorite podcast platform. It will talk to you next time. Thanks. Bye.

Big Ideas in App Architecture

A podcast for architects and engineers who are building modern, data-intensive applications and systems. In each weekly episode, an innovator joins host Tim Veil to share useful insights from their experiences building reliable, scalable, maintainable systems.

Tim

Tim Veil

Host, Big Ideas in App Architecture

Cockroach Labs

Latest episodes