The VideoVerse

TVV EP 22 - Playback experience in YouTube by Chas Mastin

December 30, 2023 Visionular Season 1 Episode 22
The VideoVerse
TVV EP 22 - Playback experience in YouTube by Chas Mastin
Show Notes Transcript Chapter Markers

In this episode, Chas Mastin shared his insights about how YouTube ensures premium playback experience for its users. As one of the largest streaming platforms globally, YouTube faces the challenge of delivering supreme experiences to millions of concurrent users and playback sessions from around the world.  Chas discusses the technical challenges in CDN peering, obtaining QoE metrics from the players, and how to manage a diverse range of different end devices and video players.

Watch on YouTube

Welcome to "The VideoVerse."

Zoe: Hi, everyone, and welcome to "The Universe Podcast." So this is Zoe again, and today I have my colleague from Visionular, Bo, to join me as a cohost. Hi, Bo.

Bo: Hi. Hello, everyone.

Zoe: Okay, great. So for this episode, we actually feel honored to invite Chas Mastin from YouTube TV to join us. And hi, Chas. Good to have you down here.

Chas: Hi, Zoe. Thanks for having me.

Zoe: Yeah, I feel just, always let you, as with other interviewers we have in the previous episode, to introduce you to our audience. So, okay, Chas.

[00:00:53 Background of Chas]

Chas: Yeah, here I go. So I'm Chas Mastin, and I am a senior engineering manager at YouTube. And at YouTube the way it works is, is that YouTube also encompasses YouTube TV and YouTube Kids and YouTube Music. It's all part of the same stack in terms of the actual delivery of the experience, I should say. Obviously different apps and different teams are involved, but I'm part of the video infrastructure group.

Specifically, I am the engineering manager over what is called the Playback Experience client teams. So Playback Experience is where the video is actually sort of rendered onto the surfaces of the devices. It's where it's decoded. So we work within a world we call the media library, which is, you know, also encompasses sort of dealing with QOE experience, so quality of experience from our users and getting that feedback and building systems to more reliably deliver great video experiences and audio.

So the way I got here is probably sort of an interesting story, or maybe not, I don't know. I don't know how common it is, but I didn't start coming out of college as thinking I was gonna be a technologist. I graduated from GW in '93 with a philosophy degree, which isn't really the best way to get into technology because I didn't really want to. I wanted to be a comedian, and that's what I went and did for five years. And while I was out on the road performing in the mid '90s, the internet was happening, right? And the internet in those days was not something, the World Wide Web was not something that could deliver video.

But, you know, I had this sort of, what you would call like a, I mean, it's religious experience, a vision in the mid '90s that it would, right? And that the basis of our future communications was going to be on the internet, as it is today, both in a peer-to-peer sense, but also in a video and experiential delivery sense. So I said to myself, "Well, how am I gonna do this? Like, how am I gonna be part of this industry?" And at that moment, what I did was I decided that that was what I was gonna devote myself to in my life. And I had to learn how to program. I had always been a programmer. I had always noodled around at computers, but I had to learn systems design. I had to learn how to build websites. I had to learn whole new technology stacks. And then flash forward 10 years later in the aughts, I was the CTO of a company that was delivering video. and we did it with Flash. If you remember Flash.

And we could talk about Flash a lot here. Flash did not survive the transition into mobile technologies for a variety of reasons, Which we can talk about. And so unfortunately, even though we pivoted to sort of augmented reality, that company didn't survive the global financial crisis so I was sort of back out in the world saying, "Okay, what am I gonna do?" And at that point, I got some great opportunities. I was CTO of a couple other startups, and then I became VP of engineering for an R&D company that did a lot of machine learning, sort of early machine learning work, worked with robotics.

Zoe: When was that? You said early machine learnings. What were those years?

[00:04:16 Early machine learning]

Chas: So this was around 2011, 2012. I started working with a company called Control Group that now is a company called Intersection. And I was running the R&D group there. And so we were using computer vision systems to do all kinds of image detection. So that was sort of in the early, I think of it as the early years. Of course it was, there was already 40 years of development on these neural networks before then but-

Zoe: But still actually, yeah, the deep learning-

Chas: Yeah, go ahead, Zoe.

Zoe: Actually, the recent deep learning starts, I guess, since 2012, so you were really early, actually.

Chas: We were very early. I remember I was, you know, downloading and trying out convolutional neural networks and just being shocked. I was shocked at what they could do because we were training things very like, traditionally at that point. And then suddenly, you know, CNNs and GANs and all these things came along and just, it was magic. And so that was great to be part of that. While I was there, though, I got this opportunity to go over and be a senior engineering leader at a company called FuboTV. And at the time, you know, OTT was, you know, sort of really on its ascendancy, and people were like, "Hey, we can use the internet to actually replace cable." And I had a lot to learn really fast, and I'd love to talk a lot about experience at Fubo and all the things I learned.

Zoe: That's great.

Chas: But mostly what I learned, more importantly than anything, is I learned the sort of basic principles of what you would call SRE, site reliability engineering, right? And these are how to set SLOs properly, you know, how to, you know, actually communicate with data in a lot of ways, and really the most important thing I learned at Fubo is the importance of QOE data in having it as your North Star when you're building more robust delivery systems. And I think there's a lot of stories we could talk about there.

Then after COVID hit, I was working from home and a new video service came along called Peacock, and they contacted me and said, "Would you like to come over and be the head of our video quality and CDN management?" So that was the role I had at Peacock for about two and a half years. I got them from literally the first days up to, you know, over Super Bowl scale, you know, all of the great things that NBC brings because they also, I learned a lot about, you know, just film delivery, sports delivery, you know, their commitment to excellence as a company.

And then about a year ago, just over a year ago, I got the opportunity to go over to YouTube and work within the Playback Experience Group. And it has been an amazing, amazing ride for me. And I've learned, I feel like a year ago, I was starting from scratch because the way that YouTube and Google does things is, it's really, really different. And it's really great to experience, you know, just how the commitment to excellence, but also sort of the processes that they have in place to ensure that a very, very critical system to our world information exchange isn't damaged because somebody pushes, you know, a bad release, which is the number one danger, of course, within any video system is trying to improve it because as you try to improve these things, they're very fragile and you know, you have to have robust players and you have to have robust delivery systems to make sure that they don't go down.

Zoe: Wow, okay. That's your experience. I knew that you had abundant experience, but not like this. I didn't know that you didn't start, you even didn't even start from tech, from college time. You actually, every stage I can feel that. You put every start button, but you learn new things, also loving your previous experiences. So we like to hear a lot of stories down here because here we talk about video, talk about media. So we like to actually want to discuss, for example, some experiences like in Peacock, you mentioned, right? Before you joined YouTube. And prior to that you were at FuboTV. And then we like to actually learn a little bit, how do you feel? Now you are at YouTube TV, so how do you feel that difference? Some difference, right? Compared to the work you have been doing in Peacock and Fubo, for example.

Chas: Yeah, so I think that the best way to think about it is, at both Fubo and Peacock, we were starting from the beginning, right? So we had to establish like, what metrics do we care about? And I remember distinctly coming into to Peacock sort of day one, and day one we had an outage, right? We had an outage on a particular CDN.

And we had to look and say, "Well, what do we consider an outage?" Right? Was it that rebuffer went up and, you know, or did it take longer to join a stream? And you know, to me, it provided this great opportunity to talk about like, what we were gonna establish as a baseline for performance. And to me it was playback failure, right? So ranking your metrics is the number one thing to do. Like, everybody loves metrics, and everybody loves to talk about data, but that's actually what you need to do is prioritize it. So the very first thing that me and my team did at Peacock was say, "There is a specific rank of these types of failures, failure modes that we care about and we care about, you know, crashes of the application and playback failures number one because that's a huge disruption to users." And then number two, we would put rebuffer, right? And number three, we would put bit rate and the quality there. And number four, we would put join latency. And, you know, I'm not giving the exact rankings, but I'm giving you sort of the example of how we stacked everything.

And by doing that, we gave this framework for us to how to respond to problems and how to prioritize those problems. Because the one thing that I will call out for NBC and Peacock, they have this incredible thing that I've never seen in any other company, but they have like, problem management. They have people that when there's a problem, they're like a team of people that sort of just makes you fix the problem over time, right? Sometimes it takes months and months and months to solve the root cause and solve an individual problem that might exhibit as just a small spike in rebuffer. So having a team like that was really important for us to get to the level of performance and scale that is needed. And I actually have a really important point that a lot of times I think a lot of people miss. Video is not only about quality, and it's not only about scale. It's actually also about scale at quality, and those are three completely separate things, and if you want-  Go ahead.

Zoe: So you mean three things, right? Scale, quality, scale and quality. That's three things.

[00:11:23 Three things matter: Scale, Quality, Scale and Quality]

Chas: That's right, so you scale, like, anybody can just buy as much CDN terabits a second as you need, right? And you can just throw money at a problem. Quality, though, is not that kind of problem. You have to be able to, you know, quality is actually part of your delivery infrastructure pipe and how much, you know, data you can move in and out, how much compute resources you can commit to transcoding and coding, packaging, those kinds of things. And scale at quality, if you want to solve that problem, what you have to do is take very, very, very seriously the quality of experience of your users at times that maybe you don't care about it, right?

So like a Tuesday afternoon, suddenly there's a spike in rebuffering. And a lot of people would say, "Well, that's transient. It's gone, it doesn't matter." But if you want to scale at quality, those things really, really matter because they often expose the basic fundamental flaws that you might have in your delivery system. It could be at origin, it could be in the CDN, it could be in the packager. And if you want over time to ensure that during the largest events, the Super Bowls, that you're performing well, you have to take those moments seriously. And that's the thing I learned at Fubo. And if I could tell you a quick story of how I learned it.

Zoe: Of course.

Chas: So, when I first came in at Fubo, part of our stack, I don't know what the stack looks like now, but at the time the stack was really fronted by obviously CDN, but in front of the CDN was Redis Clusters that basically replicated our manifests to then be delivered to the CDN. And our setup of Redis at the time, it was okay, but at certain times, right, it would malfunction and it would have problems. So, you know, I think a lot of companies might have said, "Well, you know, it sort of works. It works fine. Don't worry about it." But like, something felt really wrong to me. And so we spent a lot of time really focused on trying to improve those Redis Clusters and making them performant. And we used load tests, and we used, you know, optimization techniques on the server and on local machines. All that being said, it wasn't good enough. We had a major outage two months into my role there during a World Cup... Sorry, World Cup, what do they call those?

Bo: Soccer.

Chas: Soccer, but not the tryouts. It's the World Cup qualifiers, right?

Bo:
A qualifier, okay.

Chas: Yeah, thank you. Thank you Bo. And oh my goodness gracious. I mean, it was a massive problem. And in fact it led to the firing of the CTO and ultimately, it was incredibly stressful for people. And what it showed me is that I should have taken those early inclinations that something's wrong even more seriously. Like I knew that I could feel that there was a problem there and it was a problem we did solve afterwards, but if you want to prevent those major outages that are gonna cost you millions of dollars, potentially, or a lot of your customer base, you have to take those things seriously. And that gets to, I think, the core of what video, great video services require, which is people who are committed at the very base to having an empathetic response to your users, right?

You have to care about your users. It has to really bother you when somebody is having a bad experience because otherwise you're not gonna put in the time and the effort and literally your time on the planet. You're not gonna devote it to helping those people have a good experience. And, you know, so sports video is obviously a part of this. But also, you know, when people buy movies, you know, VOD. VOD is a lot easier to deal with than live.But ultimately, you know, getting a bad experience on VOD, if there's a problem with it, you know, as a leader in this industry, it is your job and your task to represent the users. And to that end, by the way, all of our leaders, you know, technical leaders spend time on Reddit, right? Like, we read those threads. Like, we feel the pain of people that are saying- You know, "Hey, you know, this is not what I paid for." Like, we feel it and we do our best every day to solve and resolve those problems.

Zoe: Yeah, I think you mentioned a word, empathetic. Actually, I tried to learn how to pronounce this word, really, and I do feel like it touched my heart when you mentioned that word. I still remember you once mentioned three Ps. I just can't help mention that because you mentioned people first, product second and process third. I think this is also along the same theme that you just said, right? If you feel as a team to create a product and you feel empathetic and taking the shoes from the user point of view, a lot of things that you may care. And then because you care and those issues can actually be resolved before they become a disaster. That's really, I read from you the story you just mentioned.

Chas: Yeah, I thank you for bringing that up 'cause I wanted to talk about that. And I actually wanted to talk about it in relationship to Agile software development. You know, I was already building software when the Agile movement, you know, so Agile, you know, came out of, in 2001 they had a conference, and a bunch of these theorists on building software got together and said, "Look, here's some principles we can agree on." And somehow all of that revolutionary Agile Manifesto turned into Scrum, and the great irony is that Scrum already existed in 1995, existed before this. It just wasn't really popular as a process.

And so now we have a generation of people that when you say Agile, they mean, "Oh, Scrum? Yeah, I've done that," right? But like, that really misses the point, right? Like, the whole point of Scrum was to have a feedback mechanism with the user, right? In this case, the user is the person who's paying to have the software built, is the way to think about it. And so, you know, ultimately the most important aspect of Agile to me was always about delivering software and doing it fast and also that sense of empathy with the person who's paying. And in the case of a product like video online, like, we're all paying, right? Whether we're paying for free with our time or whether we're actually paying money, you know, we have to cater to that user.

[00:18:13 People is more important than product and process]

So I've sort of boiled the Agile philosophy into this sort of very simple-to-remember thing, which is, you know, people is more important than product, which is more important than the process. And the people are not only your end user, but it's also your software engineers. Because the reason we want to move faster and do better is not just because we want to serve them. Engineer frustration is the ultimate metric that gets them, you know, disillusioned with this industry, that makes them leave companies. Like, they want, you know, engineers get into this so they can build great things and they can move fast. And if you put too many things in front of that and you make them unhappy, then they leave. They find a better job. So people are first because software gets built by people. Even with AI, it will be built with people in the loop.

The second is product. You have to care about your product. You have to be passionate about it. Like, product is not something that just lives within the domain of people who have product owners or product managers within their title. Like, every engineer, you know, I always endeavor to tell them, you know, "This is the most important thing you're gonna be engaged with, is caring about this product and making it great for our users." So I think that when you do that, you build real leaders within the industry.

And then process is really important. But process only serves the product and the people, right? Like, process on its own can be a huge waste of time, and we have to be very careful. And that's exactly what happened with Scrum in our industry, that, you know, instead of becoming this lightweight process that could be adjusted by teams, it became an actual like, heavy process with, you know, reporting and estimation and all of that, that frankly, a lot of it's not even in the "Scrum Guide." Like, you know, if you look in the "Scrum Guide," it doesn't talk about planning poker or putting point values and figuring out velocity and all that. That's like cargo culted stuff that came on top of it. So, you know, unfortunately I've sort of abandoned Scrum because I think that it leaves such a bad taste in people's mouth when it's improperly instituted that it's more important just to be simple and say, "Look. We're great people working together. We're gonna care about our user. We're gonna care about our product, and we're gonna improve that process. We are all gonna take a role in making the process as lightweight and as efficient as possible."

Zoe: Right. I think, Bo, you also once experienced similar-

Bo: Yeah. Right, yeah, so I'm thinking that for YouTube TV, YouTube TV is different to your previous employers like FuboTV or Peacock or many other streaming services. So I'm curious about how you know what features the customer, the users need and how you collect those features from the customer. Basically, I want to know like, the product feature decision process at YouTube TV. Like, what features are needed?

Zoe: Yeah, I had a similar question, indeed. I just wanted to make some comment on here, because for Peacock, FuboTV, to me it's, I think the target sort of clear. And talking about YouTube because at least the user base is huge and the varieties is just also huge. And then different segment, different scenario, different regions may have all kinds of different requests, right? So I have the similar things. There got to be a lot of features that actually stem out of the market. There could be so many. And then how are you going to like, make the decision?

Chas: Yeah. It's a fascinating question. And by the way, I want to state upright that I am not speaking on behalf of Google or YouTube TV in these situations, but I can share sort of some basic philosophical principles that I've observed, which I think are unique to YouTube. YouTube thinks about the globe, right? They think about the billions of people on the planet, right? We have billions of unique users a day. And so you can't, you know, generally speaking, when a feature is presented at YouTube or a proposed feature, whether it comes from a suggestion from users or whether it's just sort of known that this is an area, a strategy we want to move into, it needs to work for everyone, or at least as large of a percentage of the proportion of the population as we can make it work for.

A great example of this is, you know, we brought Multiview to Sunday Ticket and to YouTube TV in general this year. And I think it's an amazing counterpoint to the way Multiview was introduced at FuboTV where at FuboTV, we knew we could do it on Apple TV so we immediately started there. But then we started realizing, "Well, there's gonna be real limitations on smart TVs and in other places because they don't have the processing power to decode multiple streams, right?" You know, the Apple TV is a high-end computer that you're paying a lot of money for, and it works great as a streaming device. Your laptop can decode multiple streams at once. DRM streams, obviously, is what I'm talking about. But a regular smart TV can't. It just doesn't have the abilities and the capabilities to do it. And I know that's frustrating for people who spend a lot of money on these TVs, but also realize that these TVs, a lot of them are going for $100, right? And where they're skimping is on the processing power, right?

So, you know, ultimately our approach to bringing anything to the market from what I've seen, is to always consider the totality of the userbase first. And I think the, you know, not that we don't have future plans, and we've already announced these future plans for, you know, some level of customization, the level that we can get to that's gonna be practical. But I do think that that does change the approach a little bit.

And I actually wanted to make one other point. One of the frustrations that I've seen from engineering managers and engineers is a lot of times they feel at places like Fubo or Peacock that their ideas don't get addressed, right? That they don't get answered because they have really good ideas. They spend their days in and out of the code. They understand what the capabilities are of the devices. And that's not true at Google and YouTube. You know, the best ideas bubble up from the bottom or they come from tech leads or managers and, you know, there's consensus-building that happens around them. And the one thing also, I will say, working with such amazing folks at Google and YouTube, any idea you're gonna come up with, I guarantee you somebody's had it before, and so you can find an ally really fast, right? You can be like, "Oh." Someone will say, "You should go talk to so-and-so. They're actually already working on that." So, you know, YouTube is a highly collaborative environment, and I've certainly felt more listened to there than at any other place I've worked.

Zoe: I see.

Bo: Yeah, good to know that.

Zoe:
So, back to, because we're here actually talking about the YouTube scale, right? And then we are also just now talking about people, especially engineers actually belonging to this category. So considering the big scale of YouTube, now you talk about, just now you talk about, the scale plus quality is actually being a very different category as opposed to scale or quality individually. And then, so talking about YouTube, then how, with such a big scale, that you can guarantee or at least keep improving the quality delivery?

Chas: Yeah, it's a huge topic. And by the way, you know, I'm in no way an expert in this. I really am not, but I will say-

Zoe: Yeah, I think we talked about, I think you once mentioned that there is so many, like, different quality metrics, right? But you also mentioned we don't just look at the metrics. And then there's a priority down there, and there's the distributions down there. And so also, I remember it was mentioned just one region maybe, as far as I remember, like a Pakistan, they actually blocked something, and it's just one thing that actually has influence in other regions, that some issues actually stem out. So because of the scale, there's one part actually got some issues. Seems like only, okay, just, they stop, they block, that's it. But actually it has all these different connections with other things because of the scale and then-

Chas: Yeah, yeah, let me just jump in. I got so many thoughts on this, I could spend the next hour talking about it, but I'll try to, I'm gonna try to coalesce everything a little bit. YouTube has, you know, the disadvantage of the scale is the cost of the scale. And it means that the only way you can manage delivery across the entire planet is to own the entirety of the stack. That you have to have all the way down to the literal processors in the computers, through the data center, through the networking, you know, through the entirety of the infrastructure and the way it's distributed, you know, all the way to the players, right? Like that we own this, and it allows us to control the quality and the cost of this. There's just no other ways to do it.

And one of the advantages of having such scale is, is that you don't chase the unnecessary ghosts, right? So one of the problems as we were scaling up at Peacock, you know, for the first like, year, if you don't have enough users on your service, every single little networking problem is something that you're gonna have to diagnose and figure out. And especially if you have a CTO that is wagging his finger at you saying, "Hey, go figure out what this was." And a lot of times, like, it is literally last mile.

And I could tell you stories about, you know, I've seen metrics move at Peacock because somebody apparently was like, sitting on their remote controller and like, repeatedly making requests, like tens of thousands of requests to, you know, to a particular playback. You know, right, so this is a problem of scale. Like, when your scale is too small, you're actually gonna be chasing these phantoms because go back to what I was saying before, you have to chase them if you really want to know how your system works.

But as you reach a scale of YouTube and you also reach a maturity of your metrics and understanding of them, then you don't have to spend your time on those things because the metrics themselves are so normalized just based on the number of users that you have. And so that saves you a lot of time. And also you can create what I would call synthetic metrics, which would be aggregates of your various other metrics, right? Like, so you could say, you know, "Hey, I'm going to combine crashes and startup time and rebuffering into this other synthetic metric that isn't gonna move up and down all the time."

Because what you want to do is task your engineers with solving the real problems about, well, what are the optimal retries on your player based on your stack and the type of server-side ad insertion and your delivery method and all of that? Like, how do I scale, you know, and how do I prove that I can scale past a certain point? Like, all of these are like, how do I encode things with the optimal bit rates for certain types of content? Like, these are the hard questions, and you can't be chasing phantoms while you're trying to answer these hard questions.

So I do think, one other I want to point out about YouTube, I've had people ask me, you know, "Can you explain to me how YouTube does it?" Like how YouTube simply delivers, right? And to sum it up in one short paragraph, YouTube does this and so well because they've had 18 years of doing this, right? The actual system is not designed, it's evolved. Yes, it was designed, each individual part of it, by super intelligent engineers, but like, this is nothing that anybody could spin up, you know, in a year or even 10 years.

And that the kinds of changes and innovations that YouTube has brought, first off, users don't see them, right? They don't understand them, and they're all behind the scenes. And the only way you get there is by creating what we call OKRs, right? Which are basically, you know, key results that we're trying to deliver that are very specific performance improvements. If you don't have that, and if you're not willing to say and challenge people, "Can you move this startup time, you know, 4% over the next year?" It just won't get better, and you'll have an okay service, right? But if you're gonna get into streaming, you know, whether you work at a big company or a small one, being okay just is not gonna cut it.

Zoe: You actually mentioned the OKR. I remember my old time down there. I'm trying to making the OKRs. And you basically mentioned this is like, 18 years accumulation and finally built to where YouTube is for today. As you just mentioned, we still want to go back to this quality and scale thing that we have talked about, right? I think you have talked, really focused on the reliability and how we're going to actually keep the reliability, actually adapt to different cases. There always is something that is unexpected happen, and then how we're going to actually design or actually really gauge and monitor at the same time, of course, with this feedback and monitor stats as a feedback and then tell what we actually could do better.

[00:33:02 Core reliability metrics is based on the user experience]

Chas: Yeah, exactly right. Like, so our core reliability metrics have to be based on the user experience, right? Like, the usage of RAM or CPU isn't meaningful outside of how human beings are consuming video, in my estimation. Like, you could sit there and say, "Hey, we've made something a little bit more efficient," but if you've reduced the quality of the experience, then it's something you probably shouldn't have done. So that constant tension and balance is part of, I think, our day-to-day life at YouTube, and it's something I'm proud to say that I represent the user, the voice of the user within this organization, not just me, but my entire organization does. And we do our best to figure out, you know, how to reliably deliver at scale in conditions that are not optimal.

Because a lot of our focus is actually on, and most of the people in the world, if you were to say, "What do most of the people in the world have?" They have an Android device on a really poor network, right? That's most of the people in the world. And if you're thinking about delivering to the globe, that's not true in America, right? You could say, "Oh, well the people who buy YouTube TV are different, they watch..." And they are, so you have to think about, you know, your segmentation, but also like overall, you can't just deliver something for people who have a really high-end streaming device and not be thinking about, you know, the low-end consumer. And the same logic has to be able to cater to both of these situations.

So, you know, I can't really get into the details of how we do it at YouTube, but I can tell you that it's not by magic numbers. Like, I think that one of the... I would like to speak for a second on magic numbers if that's okay. You know, magic numbers, since my beginning of experience in programming back in the Flash days, you know, they're the easy things. They're attractive, right? Which is like the magic numbers. How many times should I retry if I get a 404? Well, people will say, "Well, 404 means there's not a file there." But does it? Because it turns out that a 404, depending on the origin you have, might be returned because it's an actual, you didn't authenticate properly, right? And maybe it takes a little bit of time to authenticate. So there are cases where 404s... Or maybe a 404 is delivered because the media isn't saved yet on the server on the origin, right?

So, HTTP codes are not like, you know, I think naively a lot of people think, "If I could just figure out the magic number, then everything's gonna work great." But eventually you start, after 20-plus years of seeing magic numbers fail, you start saying, "Well, I gotta get more sophisticated about this. I gotta think about these as a distribution, something that can change over time in different conditions." And you know, the answer to all things that are not magic numbers are, well, you don't wanna have a manual person in there, a human being with a dial, right?

So these are things that we automate. We figure out ways to build resilient systems that are not based upon hard-coded values. And, you know, that's where we get into the super-sophisticated world of machine learning and all of the incredible things that can be done in this world. And I don't think there's a right or a wrong way to do those things. I think the right and the wrong is deciding that magic numbers are not gonna be part of your stack and then over time evolving away from them.

Zoe: Yeah. I was thinking, by what you just described, 'cause with this reliability in mind, will that be actually hard if you introduce something new? For example, you want to actually get some upgrade or introduce some new feature or you want to actually provide a new service to this huge user base. And I would just wonder, would that be actually making things hard because, oh, I want to keep this system to be really reliable, and I want to monitor all these things to make sure that they really just, at least because if you don't do, usually we actually, inside our team, if you do not want to make any mistake, then do nothing. If you do anything, no matter what you do, there's always a probability you'll make mistakes. So, yeah.

Chas: It's a high probability, right? So this gives me the opportunity to speak to, you know, first off, I want to make two points on this. One, I would say that 90% of all outages that I've seen is because of the introduction of some feature, right?

It's not just because there's some flaw. It's because, oh wait, we wanted to have DVRs that were longer or we wanted to do something different, and then that creates a cascading problem, right? That can create outages.

So knowing that, how do you safely develop, right? And there actually is a very specific answer to this, and the answer is via feature flag-based development. That is, you think about this as experiments, right? That instead of saying, you know, "Hey, I've got a new thing I built. Let's release it and see what happens." You instead-

Zoe: That's dangerous.

Chas:
You don't do that, right? You instead say, "Hey, I'm gonna release to 0.01% of my customer base. I'm going to examine that traffic as a separate slice, and I'm going to make sure that my metrics should not move in that, and then I'm gonna increase it to five or 10 or..." Like, there are different arguments to be made for how you do rollouts, but like, ultimately, every single thing within Google and YouTube that I've seen is an experiment, right? And that's the right way to approach this. It's not just specific to YouTube and Google, but even a small company like Fubo had a feature flag-based system that we used that allowed us to not only quickly change values without deployments, because especially in the mobile space, deployments can take weeks, right? And rollbacks can take days. Like, you cannot just sort of YOLO, like, improvements within an application and then release it. You need to gate things behind feature flags. And those gatings themselves should not just be a Boolean. You should be able to direct and say, "Hey, I want this device to get 20% of this traffic." And then you can examine that and be rational about any of your changes.

And part of that is understanding that in a complex, dynamic system, any small change could have impacts outside of your system that you're changing. And especially for things where we're very sensitive to, you know, how many requests per second are coming from a client. So you could be, you know, theoretically you could create a very robust player that is retrying all the time that actually DDoS's your origin, right? So you want to be able to, once again, it's not about finding the magic values, necessarily, but it is about safely finding the things that work to give you the opportunity and the runway to make more sophisticated improvements over time.

Bo: Yeah, and I think for, so you work for the player team. So there are a variety of players, different players and devices on the market, like web player, Apple devices, Android devices, Roku Players. So how do you handle that heterogeneity of different devices? Yeah, like certain features may work on one device but not on the other. How do you handle that things?

Chas: Well, I'm gonna speak less about like, what we do at YouTube and more just sort of my philosophical position on it. I think the right approach is to use players that are open source like Shaka and like ExoPlayer, and there's a variety of other ones. Or there's great players out on the market. You know, Bitmovin and other, JW Player. Like, I wouldn't, I'm not gonna criticize them, necessarily. They solve these problems very well. But I would say that the best way to solve this problem in a very general sense is find players that can work across a wide variety of platforms. Because what that does- Is that it keeps your player teams small and cross-functional, right?

'Cause what you don't want to have to do is spend all this money to hire all these people that are experts in ExoPlayer and then realize that ExoPlayer is this small slice of your market, right? What you want is to find, you know, I've seen companies, I should say, be able to choose one single platform that encompasses, that covers like 50% of all of their traffic, right? That's the kind of sweet spot and beyond that you want to shoot for. Unfortunately within the Apple world, most people are gonna be stuck with AVFoundation, not unfortunately because it's a bad player. It's just you're gonna have to have some experts within the iOS and Apple world, you know, on that stack.

Bo: I see, I see. Hmm. Okay, interesting. So you want to use the open-source devices that are widely supported on different devices, different platforms, so as to minimize the development effort and maintenance effort.

Chas:
Yeah, and also there's a lot of people that have built ways to integrate with things like Shaka already, right? There's a lot of folks that are contributed to that. So if I were gonna start, you know, I've mentioned Shaka a few times, but honestly, if I tomorrow were gonna start a new video streaming service, I would probably choose Shaka as a player and ExoPlayer and then and AVFoundation and I'd probably be like... Or Media3, sorry, whatever they're calling it now. You know, that with those three, you're gonna get a large portion of the market. Roku obviously has their own players, and I would say you're probably best just starting with the Roku Player. Roku is very, very reliable. They know what they're doing. You know, it's probably best just to play their game instead of trying to build something custom.

But yeah, unfortunately there's no silver bullet here. Like, we have to support multiple players even within my larger team, you know, we have multiple client teams, and I think that you're not gonna be able to get around that. And I frankly think that this is probably one of the last things you'll be able to automate because it requires, like, these players have really complicated interactions based on the type of media, the type of scrubbing you have, the type of buffer fill you're using, like, the various different failure states that can happen on networks and delivery need to be accounted for and fixed over time. And then, to Zoe's point, people want new features all the time. They're like, "Yeah, but I want it to do this, and let's do ads and let's do that," right? So you're gonna need some expertise in the players, and I would say it's one of the best investments you're gonna make in terms of if you're a streaming company.

Zoe: Right, yeah. So basically, I feel like no matter how much we care about reliability, that will not preventing us adopting new features, right? Because the world is always advancing, and now we always want to improve on our offering. So talking about this, I don't want to actually predict, like, really what's going to happen next 'cause it's always hard to predict. Even within our team we said, "Oh, we should have a long-term and short-term goals." Our long-term, actually, because we're doing a startup, we can only see maybe, of course this is a really like, ultimate, we want to grow to whatever size from our side, but talking about next six to 12 months is already something pretty long-term.

So at this stage, Chas, I just want to get your view. So what do you think the next step in terms of adopting features and at the same time actually maintain the reliability and the quality of the system? And I want to make one comment, actually. I was really, again, impressed by, you talk about, again, the people first. Because every single step, whatever we do, new feature, we want, actually, the end user to feel happy. But you also bring that the people, not only including the end users, but also, for example, the team members, right? Engineers that actually recall me just now, I remember Zoom's founder once mentioned that they really care about two happiness, team members' happiness and customers' happiness. That's also, I think, along the same essentials we're talking about here. So with that in mind, how do we see the next step? 'Cause I want to consider the time. That's to be the... Actually, let's talk about the last topic.


[00:46:43 Next step]

Chas: That's great. All right, so, and I just wanna get it clarified. Would you like me to speak towards what I think the future of video and audio is?

Zoe: Yeah, whatever you think the next step, and that with we just, that the spirit is just the people first, and then reliable businesses, we really want to make sure, so that will be the next step, yeah.

Chas: Yeah, so people are always gonna be what everything is about, right? It has to be, right? And that all of our sort of machinations with technology need to always be in service of not just human happiness, but rather human fulfillment, human meaning, right?

And, you know, to that end, I still am like, you know, for a part of my career, I was into augmented reality, right? And mixed reality, and that's what I was really interested in. And I think with the Apple Vision Pro coming out, we're going to have this sort of hero device that, you know, ultimately is gonna spawn on all kinds of imitators and eventually will go into our glasses, right?

And this is where video and audio is gonna go, right? Because it's you know, spatial video and spatial audio. And you know, how many years out are we, right? Like, I'm not even gonna tell you five years. I'm not gonna tell you 10.

What I'm gonna say is, is that ultimately, the biggest criticism I could make of any aspect of video is that it sits us on our couch. It makes us stare at our phones. We're locked into place, but we are not beings that are meant to be in place. We are meant to, we're physical beings, and transforming our world via video and via turning everything into you know, a different kind of experience or just allowing us to see another layer of reality, right? Like, that in itself, I think is transformative. And that's where I want video to be. I want video to be an aspect of our everyday life.

Like, I would love for someday for us to have this conversation as we're walking through the woods together. But I'm walking through the woods and you're walking through the woods, but we're both experiencing our presence in the same space, right? Like, that to me is the kind of experience that I want to have, and quality is gonna, it's only gonna work if our Wi-Fi systems work, if our wireless works, if we have the right compression algorithms and the right codecs and that we can have a, you know, surround sound that people actually want to use. Like, I think that ultimately it's all about bringing us closer together and together us exploring the universe.

Zoe: This is really the... I mean, I won't say it's a completely new, but really alternative view because lot of times we talk about, "Oh, you can watch videos," meaning that you can enjoy something virtually, but sitting right here. But you mentioned that we actually develop more video technologies, it doesn't mean you have to really confine into, because we are physically human beings, right? That kind of should only encourage us to explore more about the world, just provide something to actually make people close together, but not meaning that you have to do something. Okay, there's a virtual world, but that's in parallel of our physical world.

Chas: Yeah, and by the way, I want to call out generative AI as part of this. And Gemini, the announcements recently are amazing. And in particular, you know, the ability to create music on the fly via AI and have it as a shared experience, right? Or even something as simple in, even without AI, like within the music app, now you get lyrics that show you, you know, what the lyrics are as the song's going on so you can sing along, right? Like those are the features that are, you know, that connect people with the media and make them experience it, not just in a passive way. So I'm a proponent of all things that sort of activate us as humans, and whether it be educational videos that we're interacting with or whether it just be using video for this kind of experience and this kind of podcast. Like, you know, this is just the beginning of a very, very rich future, which will have some rebuffering in it, but that we're goingBut hopefully not a lot of playback failure.

Zoe: Well, thank you, Chas. And then we are really, I think, aggregating all the topics that we have touched, talked about, I think something essentially done there that you talk a lot about people. We are part of the people, and you talk about all these new things or the things that we endeavor, working hard on is actually make our life more beautiful, right? Make the user to feel more fulfilled and happy. So we really just feel grateful for you to actually bring us to this view of what we're doing about videos. Thank you for your time. Really appreciate that.

Chas:
Thanks, Bo. Thanks, Zoe.

Bo: Thank you.

Zoe: And thanks of everyone to pay attention to this episode. We'll see you next time.

Background of Chas
Early machine learning
Three things matter: Scale, Quality, Scale and Quality
People is more important than product and process
Core reliability metrics is based on the user experience
Next step