The VideoVerse

TVV EP 16 - Optimizing User-Generated Content for the world

March 26, 2023 Visionular Season 1 Episode 16
TVV EP 16 - Optimizing User-Generated Content for the world
The VideoVerse
More Info
The VideoVerse
TVV EP 16 - Optimizing User-Generated Content for the world
Mar 26, 2023 Season 1 Episode 16

In this fascinating episode, we are joined by Balu Adsumilli, Head of Media Algorithms at YouTube. In a wide-ranging discussion, he reveals how YouTube balances optimizing video quality with preserving the creative intent of their millions of contributors, and how the media quality task is affected by the changing nature of UGC video with ever-more sophisticated contributors and the trend towards Shorts. Central to the task are metrics for assessing quality, and Balu offers his insights into the state of the art as deployed at YouTube and the limitations of traditional quality metrics.

Watch the full video version

Learn more about Visionular and get more information on AV1

Show Notes Transcript Chapter Markers

In this fascinating episode, we are joined by Balu Adsumilli, Head of Media Algorithms at YouTube. In a wide-ranging discussion, he reveals how YouTube balances optimizing video quality with preserving the creative intent of their millions of contributors, and how the media quality task is affected by the changing nature of UGC video with ever-more sophisticated contributors and the trend towards Shorts. Central to the task are metrics for assessing quality, and Balu offers his insights into the state of the art as deployed at YouTube and the limitations of traditional quality metrics.

Watch the full video version

Learn more about Visionular and get more information on AV1

[Announcer] Welcome to The Video Verse.

Zoe: Right, thanks everyone to join this episode of the Video Verse podcast, and today we are very, to me the one of my old ex-colleagues, Google and Balu who is the head of Media Algorithms behind YouTube. And I don't have to talk more because YouTube, everybody should know that the largest user generated content, UGC content platform, holding platform. And then today we are going to talk about something definitely related to YouTube and UGC video technologies and media technologies at large. And at the very beginning, I would like Balu to introduce himself. Hi Balu. It's great to have you down here.

Balu: Hi Zoe, thanks so much for inviting me. It's really a pleasure to be here. I, like you said, I lead the Media Algorithms team at YouTube. What we are in charge of is all of transporting, for all the videos that get uploaded onto YouTube. We, really unpackage them, process them and repackaged them into formats, that are played back on different devices like, TVs and laptops and mobile forms and whatnot, right? So, in addition to that, we are also in charge of improving our pipelines and keeping them up with the latest technologies. So, we do quite a bit of research behind the scenes, on all the advanced technologies and formats, both on the audio side as well as on the video side. Right? And, and like you said before. We work very well in the past and it's really a pleasure to be here.

Zoe: Well, great to have you down here. Let me also introduce, 'cause I'm the host, this is Zoe, and I also have my co-host Thomas to join with me. And Thomas, you want to have a brief intro about yourself and then you can go ahead and talk anything, yeah, with Balu on your side.

Thomas: Hi, I'm Thomas Davis. I work on video codecs for Visionular, and I've been working on video standards, like AV1 and HEVC in the past. One thing that really interests me about, what you do is just the enormous scale. Can you give some idea of just how much video you are handling?

[00:02:42 How much video is being handling]

Balu: Thanks Thomas. Yeah, it's, we see at YouTube quite a bit of video and that's an understatement. What we generally have is about 550 hours of video uploaded every second, sorry, every minute. And we process all of that in the background, right? We have about two billion plus daily users and like the playtime for video, is about like four billion hours per day, or something like that, right? For video.

Zoe: Exactly. I just know of billions.

Balu: Yeah, it's one of those things which is like, if you consider it against the population of the world, which is eight billion or so, it is pretty decent proportion. So, we do see a lot of video, we do see a lot of channel and our processing systems are massive to keep up with that scale, right? Both on the software side, as well as on the hardware side. So...

Zoe: Right. I think this one thing, because you also mentioned previously, this is YouTube, so right now YouTube hosting all kinds of different videos and we believe there's a large amount of, user generated content from individual contributors, as you also mentioned, that every single, I think every single minute that there's more than 500 hours of videos being uploaded to this platform. And, we found sometimes let's back to some technologies, 'cause user generated content has a big variety, in terms of its qualities, right?

And then I think you were mentioning that you are leading the team, doing all this backhand transcoding with other processing, before the videos are being largely shared with the community. So, we like to learn what's so special about the user generated account, the UGC and in terms of why you precise it? Of course it's a scale part, but then let's focus more on the UGC uniqueness.

[00:05:02 UGC uniqueness]

Balu: Yeah, absolutely. So, UGC user generated content, it's like substantially different than curated content, which is professional video, most of it broadcast, most of TVs and Hollywood, post-production houses end up doing a lot of, editing and processing on videos to curate it, to like the right look and feel and the narrative, right? To match the narration there. With UGC, all of that is less of an issue. I think most of the content creators are like you and I, who are like not working in production houses, right? We want to record some videos of what is more emotionally attached to us, right? So we want to take, videos of our kids, of our pets of like, things that happen when you go out and have a nice day outside, right? So, there is no like, hey, here's a story that I want users to be engaged in, that kind of a thing. But it's mostly about like sharing, giving everybody a voice and sharing that voice with whoever you want to share. And if it just happens to be the rest of the world, that is the platform, it ends up YouTube becomes that platform for you, right?

And so, a lot of times we get like videos, that are like not as high in quality, not as well curated. A lot of saturation issues occur. A lot of like compression issues occur, because folks don't know how to compress before uploading it. A lot of pre compression, a lot of editing issues also occur, because they want to ensure that there is some look and feel to it. They try and in the process of that trying, end up creating artifacts that are unintentionally so incorporated into the user generated contents. So, when the video gets uploaded and when we see the video, it's already artifactory And so, when we look at the video, you are looking at something that is problematic, something that is like, not the best quality that you want and so when you want to transport them, it becomes so much more harder, because your original is already artifacted, right? And it becomes harder in two different ways. Go ahead, Thomas.

Thomas: Yes, I was, no, so I was really gonna say what does quality even mean in this kind of situation, where you're trying to guess what the intent of the uploader is in some sense, if you are trying to improve the video? You could in a production environment, when you're compressing, you're trying to minimize the damage. But in this case, you're trying to enhance the original as well.

Balu: Not entirely. Although there's a good part of it that I think you have it right. Quality and misunderstanding becomes extremely tricky, right? Most of the quality metrics that exist, well, to give you just a very high level summary, with your quality in general, it has been a field for a long time, right? And there's been people working on video quality and making it right from the 60s, 50s and 60s. But, there's two different fields of video quality. One is the subjective metrics and the other one is the objective metrics. Subjective metrics are essentially like, asking a bunch of people and collecting scores on like how good the video quality is. And almost always this is against a reference, right?

So, you have an original, you process that original, you come up with an outcome of it and you say, "Okay, let's compare these two "and how good do you think the quality is, "if the original is five, let's say one to five?" They're typical like on scale. And most people give a rating and the intent is like, people are generally, for instance people are generally very accurate in telling what a good quality is. But UGC that falls apart, right? You ask 10 different people about a artifactory in quality without showing the original, they give you 10 different ratings. So, it becomes extremely tricky in understanding in quality of what that is.

The other component of this, the objective metrics, have been also developed in the perspective of a reference. Like how far away from my reference am I? PSNR, MSE, SSIM, VMAF, all of these are reference based metrics, which means to say that like you actively look at how different you are from what you think is an ideal image, or an ideal video, right? Even then this sort of like UGC framework breaks down, because your original itself is pretty bad. And so when you're trying to compress that bad quality original and you're trying to make sure that your PSNR is high, which means that these are pixel based metrics, right? So, if you look at MSE and pixel to pixel and trying to get back to that faulty original and spending more bits to do that, you are ending up compressing it, not so efficiently because you're giving it more bits to get to introduce the artifacts that were there in the original.

And so, having a good quality metric here is not only like a nice to have, but essential, right? So, to understand what a good compression level is, how much can we compress before we call it done, right? It's good enough quality. And so how do we define that? How do we understand that? Sorry Zoe.

Zoe: You mentioned that before the video it is fed into the transcoder and you mentioned that better we may have idea, about the quality of the input videos, right? So, you basically also justify that, there's a lot of video metrics and that are being collected with a reference and then when the UGC video they uploaded to your back end and they're just there by themselves. And we believe, I think this is falling into so-called no reference quality metrics that you want to some idea, before basically to try to transcode to the best way.

Balu: Correct. Yes. And, lot of reference metrics are really essential here, because they work well without what a good original is. But internally they also have some correlation with the subjective metrics. The theory behind it is going back to like 1980s, or early 90s, where subjective opinion scores that are collected across large swath of people, either professionals, or novices, is a good understanding of the quality of the video, right? And so, when you have an understanding from like an early 80s, sorry, early 90s and late 80s technology that says, subjective experiments when done in a particular way, through an ITU standard, are giving you accurate enough information, right?

With X number of subjects always looking at this video and always following a particular scale, in a particular viewing conditions and all of that. Then the scores collected through that, with the average of those scores, like what is called the mean opinion scores, then forms your benchmark, that's your golden reference so to speak. And so, even with no reference metrics, the intent is to have a very good correlation to the subjective scores, so that it tells you how good your node reference metric is. And so, we have like teams live at UT Austin, building a few different node reference metrics, that are always rooted in this, your subjective experiments and MOS correlations, right? And so, there's a bunch of.

Zoe: Yeah, professor Alan Bovik. Yeah.

Balu: Yeah, Al Bovik's team. Yes, Professor Bovik has been in this field for a very long time, right? He wrote practically wrote the handbook on video quality, right?
So, his teams are doing some amazing work on how we can build node reference metrics, how we can build reference based, more perceptually aligned metrics, outside of the pixel based metrics, right?

And, so that becomes now suddenly handy and the usage of it is twofold, right? One, like you said, for every video that gets uploaded, we can run a node reference metric and figure out, where in this quality spectrum this video falls into? But then there is also like the secondary uses use of it where, you probably also don't need that at the entry point, but after you transcode, you could just run a door reference metric and say, "Is this good enough? "Is this where we wanna be?" Right?

Now, both Professor Bovik and us and a bunch of people including like Facebook and Twitter and teams in Bristol and teams in , who have been working on subjective quality and objective correlations, have faced this problem with UTC is that the correlations that you get, with this understanding that people will, large amount of people will give you a good rating on a video that are closely aligned, that also breaks down.

So, with UGC what happens is, we found that people are, since narrative is not the primary intent here, and since if the video is not curated, people rate the video based on, how much they like the content of it, and they also rate it differently if they're emotionally attached to that content. Like, if I saw a dog video, I'm likely to rate it more because of a bias that I like dogs, or something like that, right? Versus a cat video for example, or vice versa, right? There's also the liking-ness of the content, right? Like, so if I'm a big gamer, I might rate a gaming video higher than a non-gaming video, for example, right?

So, subjective moss scores are also a little bit tricky to get it right. And so, there is this new concept of, can we open that up? Can we not fully prescribe to the IT model, but can we go full crowdsourcing on quality scores? Enable viewers to look at the video the way they consume it in mobile phones in different environments, you don't need a fixed distance between the video that you're watching versus your eyes, right? So, that distance breaks down, when you're having a mobile phone for example. It it really like not conforms to the ITU standards that were defined for subjective quality before and it collects a lot more data because it's crowdsourced, right? And so there is this new wave of like, crowdsource based subjective testing to get these mean opinions scores, which is a little bit noisier, but is found to be more, a better representation of like what UGC users might feel like, right? So.

Thomas: So, in this analysis phase when video is uploaded, you can run these metrics and you get some score out, but is there more to it than just good versus bad? Or, can you get qualitative information about what is wrong with this video and how you might fix it, or how you might not make it worse in some particular aspect?

[00:17:29 Qualitative information about problems with videos and to fix them]

Balu: That's a great question, Thomas. I think more and more that sort of philosophy is being looked at, with machine learning models and with deep nets coming into play. This has become more of a thing where, a lot of content attributes are driven into the network layers. VMAF is a good example. UVQ, which our team developed at YouTube is another good example, where it's not just about the score anymore, it's also about the attributes, what the video went through, right?

So, the UCG quality metrics, for example, UVQ goes into three different nets. One is the content net, which understands the content philosophy a little bit better. Is it easy to compress? Is it difficult to compress? Does it have a lot of moving things associated with it? A gaming video for example has the player movements, but also like score and map and other critical information in the corners with text floating around. So, all of these contents are being absorbed, like there is a net to train on just that content itself. Similarly with animation, similarly with sports, similarly with other film content and so on.

There is a secondary net called the distortion net, which really focuses on the distortions network that happen like after the video is recorded, right? So, these are user introduced distortions, through processing, through pre-processing or through effects creation pipelines where like they want cinematic blur for example, or they want to introduce film grain for example, right? So, those sort of things will introduce certain amount of artifacts during processing. And then the third net is the compression net, which is really not just a bunch of artifacts that are block based, but also correlation and interdependency between these artifacts, between blocking and stringing, noise, whatnot, right?

So, these three different nets are indicative of like how much of what has happened to the video gives us some amount of data providence there, as to what the video went through, before it gets uploaded to us, right? And so, the part that, like I said you mostly got it right, is this aspect of it. The part that, going back to a couple of questions back, the part that I think we don't generally do, is heavily pre-processed that video before we transcode it. And the reason for that is creator intent. So, we want to keep the intent of the creator as much as we can.

Now, even with YouTube, a lot of like early YouTube videos suggested that like, people are educating themselves on how to do YouTube videos better, right? They had a decade, or so where people were really investing in high-end cameras, high-end lighting and microphones, and to create better quality content, than just recording it with your cell phone for example. And so, when they invest in this, when they understand this, when they educate themselves of how to make a good video, they become much better at it and they give us videos that are much better in quality.

So, the upload content becomes so much nicer to look at. And so, in those cases we generally rely on creator intent being the primary consideration. Even if you get a bad quality video, we quote unquote assume that it is intentionally driven by the creator to be that way. When somebody spent money buying apps and buying software to create a particular look that they want, and we go in and remove that, considering that as an artifact, that's not the best thing for the creator there, right? So, we generally try to keep the creator intent.

Now, in the last couple of years or so, what has happened with TikTok taking center stage and YouTube shorts becoming more and more prevalent with Facebook reels and Instagram reels and all of that, this short form content has become the way of modern consumption of video. And, that breaks this down, right? So, the creator intent is very minimal there, most of the artifacts are unintentional. Like they probably are following running behind their dog, which goes from a really dark inside, to bright sunlight on the outside instantly, and then your cameras are saturated. Your, three algorithms are not working as well, right? So, suddenly you have these artifacts that are coming in that maybe the creator did not intend, but we should be aware that, this is not like, intentionally created, right? So, shorts is a good example where some of this creator intent stuff breaks down, but in general, we try and keep a lot of creator intent in there.

Thomas: So, do you do the same level of processing on every video, or is there some ranking based on popularity or number of downloads? Do you maybe revisit some videos later on, if they turn out to be really popular?

Balu: Yeah, that's a good point. I think we have some indicators of predicted popularity through all our friends in different parts of the YouTube world, right? With Discovery, with the channel created channel folks and even partner managers, which who tell us that, this is something that yhe creators care about. This is something that we expect a lot of popularity because this particular channel from the creator has been having, a hundred thousand plus subscribers visiting every day and where and when the creators are expected to perform well, the predicted popularity of the videos in those channels goes are high, right?

And we do process the videos based on predicted popularity, like most UGC and even word content creators like Netflix already do, right? So, we process based on predicted popularity, we give what we call, the dinosaur model, or the cockatoo model where if you look at the plot, the head is where most of the processing happens. We generally reprocess those videos and ensure that we are doing the best quality for the customer that we can, with the available restrictions around, how much bandwidth they have to see this video? What resolution and quality that they're preferring to watch and all of that, right? Are they watching it on a mobile device and whatnot? So, with those restrictions in mind, we want to create the best quality and so we end up processing those a little bit more. And I don't mean to say by processing more, I don't mean to say that, we work the video more, it's just that we understand the video better, so that when we transcode it, the parameters that we use to transcode are different, than the way we transcode other videos and that's essentially, it still goes through the same number of steps.

So, we don't necessarily work on it more, except we provide more CPU, to gain more quality on some of these high end videos, right? Then there is the middle trunk of it, which is essentially the body of the dinosaur, or cockatoo which we generally trying to keep the best quality in mind given the restrictions around like how many people are watching? What is the expected watch time and all of that stuff? And then there is the long tail, which we generally process them with the least amount of CPU. We end up doing quite a few different things there in terms of like what is more beneficial for storing them and providing them to folks that are watching it intermittently and the tale is really long, really long, right?

So, a lot of videos have have zero views on YouTube and we have pretty much every video that's been uploaded since like the beginning of YouTube, right? Sort of like a archival of the world's video, if you may. So, every video from ever since the inception of YouTube, is available on YouTube to playback, but you can go through them and some of the less popular ones are like zero views. And if you think about it, zero views means, the person who uploaded the video also did not watch it, right? So, there's a bunch of videos that are hardly ever watched and we have a different set of parameters to process those videos.

Zoe: Right, I heard basically because you mentioned that for more viewed videos that you allocate more simple resources to process that. And then for the long tail, as you mentioned, is a great, so, we would expect that there should be some competition resources to save. So, could be using some hardware based, so that they have like a higher throughput and the lower power consumption, and then the higher parallelism at the same time, in that case.

Balu: Yeah, so we try not to compromise on quality. So, compression parameters still focus on quality at that point, whoever it is serving, but we tend to focus on providing it lesser CPU, so it probably takes a little bit longer to process, right? We sometimes what we do is, we probably produce the resolution ladder that is most watched, rather than like the full resolution ladder. We tend to do internal optimizations on storage for example, which would essentially ensure that the videos are constantly available, but the less popular ones are available, not at the closest point to the user, but in the server somewhere, that could be then accessed pretty quickly.

So, there's a few different optimizations that go on, on how we treat the large long tail. But what's more interesting and what's become more challenging is the more, the head part of it, the more popular ones and ensuring that the quality that we get on that head, is maintained high.

And going back to the earlier discussions of UGC, when the upload quality is not so high and we make sure to keep the creator intent, there is only little we could do with, increasing the quality there. So, it is the upper bound of like what we generate on quality is the uploaded videos, right? So, uploaded quality for example, right? And so you have this, you have this trade off of, how do you make sure that the best quality is provided, when the upload itself is low in quality? Especially on the head, right? What kind of resources can you attribute? Is it a problem that could be solved with resources? Is that a problem that could be addressed through a perceptual metric and reprocess that, in a way that the metric is more optimized, rather than PSNR being optimized, right? And so those kind of trade offs happen and that's essentially become a lot more interesting now, with shorts becoming more and more and YouTube TV becoming more and more, which is like the two ends of the quality spectrum, right? YouTube TV se get content which is like really pristine, really high quality and shorts, it's like the other way which is the linear part of the quality uploads basically so.

Thomas: With this, do you think maybe the information you're gathering through metrics, could that be something that could even be fed back to creators that they upload something and it says, "Well thanks for that, that's great, "but we think there are these problems with your video, "perhaps you'd like to shoot it again, "perhaps you'd like to process it again?" Is that something that you would consider that you could do?

[00:30:51 Intent to have creators chime in on their videos to have better quality version]

Yeah, that's an excellent question, Thomas. I think the intent is that. The intent is to have creators chime in on their videos, and figure out if there is a better quality version that they can upload, right? We generally skew away from asking them to re-shoot the videos because that could be a very timely thing that happened. That could be something that is like, on the go captures that they did, that they want to share, but sharing some intuitive understanding of quality, which hasn't existed till we introduce UVQ, was the intent, but we didn't want to share, typical PSNR numbers, or SSIM numbers with users, because that's not completely relevant for them.

And now that we have UVQ, now that we have better understanding of both the content as well as the distortions that could happen during processing for them, we would be able to share some of that back to the creator, at least on the large creators with this head of the dinosaur, which we tend to work with more frequently and enable some amount of like, this feedback based, interactivity between the creator and us. And that's more and more happening with creators who are directly in line with us, like the top end creators like Mr. Beast and others.

That is less so, with the still pretty successful, but like not as well known, especially like the most shorts creators these days have become more, quite a bit more successful with their like view counts and subscriptions, but they're not as well known to the community, as a typical YouTuber, right? So, that sort of a thing. And they don't feed in completely into the creator economy as such, right? So, there is I think there are, steps that we take to help them get to, how do you improve your quality, like how do you make it better? Typically we end up relying on understanding of their quality, right?

So, most of these discussions need to be in reference to something, right? So, the idea being if I was a creator and somebody told me, hey your quality is not good, my first question would be, as opposed to what? Right? Or, another question would be like, what do you mean quality, right? So, is it the signal quality? How I'm looking at it and how good it looks? Is it the narrative? Is it the story that I'm putting out? The quality of that video is not good? Is it not as revenue like ad making? Is that what you mean by quality? Right? So, these are different, like their understanding of quality, needs to be on the same page as us. So, we could provide something like, hey, as opposed to like your quality, signal quality, as opposed to some of the more successful creators here, or like some of your own successful videos having this.

Either that or you know, this other notion of like relying on this quality resolution kind of a philosophy, which most video players rely on these days, right? Which is not completely true, but then for novices that don't understand video quality well enough, you could go back and say, "Hey, I only see 480 P, or 360 P, "do you have a 720 P, or a 1080 P?" And so that essentially tells them, oh they need higher quality, without actually telling them that we need higher quality, right? So, there are different ways of doing this. And so, that's what we tend to rely on.

Zoe: Yeah, I think it's great to hear that, you are really focused, I mean, YouTube is really focused on the creator intent. And so, I have a question because the YouTube is handling such a great scale of video and then you do mention again, and again that you consider the user experiences, consider the creator justifications and then try to get a quality to correlate it with a different use case, different scenarios. Now, with such a large scale, how this correlation are being identified and then modelized, finally to become, like you mentioned, a system of different likes, and the precise set of content in automatic way?

Balu: Yeah, so awesome question. I think that's very, very tricky to explain like easily without a whiteboard. The very high level of it is that in general, we have the subjective scoring methodologies, right? So, when we want to come up with something that is largely scalable, you need an objective metric. And the objective metric is only so much useful, if it doesn't correlate with, your perceptual subjective scores, right? And to create the subjective scores, what has traditionally worked has not worked.

So, we have new ways of creating subjective scores. We have done those, we have done golden eye methods, we have done like, methods where we take understanding from both, professional video viewers like, or professional video creators for example, and as well as like, novices who don't understand the quality but also don't really care about quality that much, right? So, we've done those analysis and then we've done in-lab testings, we've done creative focus testings, we've done surveys and we've done crowdsource testings, right? And we've correlated them with not just us, right? We've worked with professor Bovik. We've worked with professor Akali, professor Ortega professor, like a bunch of people in the academia as well. Not just the industry to ensure that some of these understandings are rooted correctly in that subjective sense.

And then once we have that, once we have the ground truth data, once we have the data set that goes with it, we launched ourselves in 2019, a UGC data set, that's been pretty widely used. Yeah and that, we have added ground truth data to that the following year. And we've added also labels now for any metrics that are using the deep nets in the more recent years, that Thomas' question alluded to earlier.

So, it now has features, sorry, labels for these features. It now has ground root data from subjective experiments and these cross correlations and it also has the dataset itself, which is like unprocessed raw YouTube videos, right? What we got uploaded, not what we processed, right? And so that dataset has been pretty useful for a lot of folks, and like others are taking that as an example and doing more and more UGC data sets, as much as they can, at both smaller scales and also larger scales right now.

With that dataset creation and with that ground true data, we got like a good enough understanding and insights into how viewers like to enjoy UGC content and how different that is from professional content. Now having that understanding, like you said, having creators understanding as well, correlating that and then finding something that is metric driven, right?

Because this can't scale with the scale of our operations. Having that with metric driven, we tried and tried and tried some more and came up with multiple attritions later, came up with the UVQ models, which essentially started showing us some really good results. And with some amount of the proper training with those models and proper training of the three independent nets that it is in there, we got some really good results.

And at that time when we actually done the stress tests on some of these, for every category of video content that we tested, for every category of, and the large variety of like, compression parameters that we tested, inside each category of content like film, or animation, or gaming, or cartoons, or whatnot. Within each content, UVQ was giving better than any other metric in that category, right? Like so at least as good as the best metric in that category and in a lot of cases better than that metric, right? So, and each category had a different metric surprisingly, although there was some commonality in, we found that VMAF was like a pretty good, top level metric in a bunch of these. It was at least doing as good as VMAF at the time that we tested and now we are seeing that, that could be further improved on and we are seeing much more correlations to subjective scores.

So, to give you a rough idea, before UVQ, most of the existing metrics were about, 0.7 to 0.8 correlations and with UGC, they fell down to like 0.5 fives and .6's right? Even with PSNR, is in the 0.49 categories, right? SSIM, VMAF is also didn't work so well, but like VMF was one of the better ones. Nike was doing well, TLVM was doing well. Repeak, which we worked with Professor Bovik's team. Repeak was started to perform well, right?

But getting good correlations out of it, is extremely difficult. And so with UGC, if you consider the average across all categories, I mean independent of how some of the categories are a lot more used than other categories, right? Like, so gaming, we see a lot more uploads than other categories, for example. So, independent of K-means distribution, if you take uniform distribution and take the average across these, we found that like the correlations with UVQ, have been in the higher 0.7's and lower 0.8 range, which has been the best that we could and some of the improvements that we are planning to roll out further, are going to improve that further higher, right? As we do these.

So, and again some of these numbers are like already pre-pandemic at this point, but there's a lot more going on. Folks have now taken UGC data sets, primarily trained models and now coming up with some good metrics and thankfully so, we need those. We all need those metrics including, ourselves, Facebook, Twitter, all the people who play in the UGC world, but also all the people who play in the non UGC world, because I was talking to folks at Netflix and they're saying they're seeing a lot more, free willy cameras user, mobile phone captured content, in their documentaries increase a lot more these days, right?

So, pretty much I think that the whole industry will benefit from some of these metrics, and once we get those correlations, we actively try and apply the objective metrics at scale. I do have to clarify though, I think Thomas brought this up earlier. There is the algorithmic development and treatment of UGC and the algorithmic improvements to video and audio. There is the scale part of it, which is a sibling team that does that, which is a video processing platforms team, which is responsible for ensuring that these, transcoder operations are performed well, as in relation to memory and CPU and which part of it goes to hardware and codes, which part of it goes to software and all of that.

So, that's really like handled by the scale of it, the massive processing of it is handled by sister teams, still within the same video infrastructure, but we are in charge of the development of algorithms and ensuring that we get the more optimal performance for UGC.

Zoe: Yeah. Well thanks for the sharing this. I have to mention one thing that the YouTube UGC database, the dataset is really beneficial to the community and actually because it has been like Rev is open source, and then they already classified and it has labels in terms of metrics and then I believe not only us, a lot of people really benefit from that, both in industry and academia. So, that's really greatly appreciated.

Thomas: I wanted to ask you a question. Sorry, carry on.

Balu: Just a closing thought on that. UVQ now is open source Zoe, so everybody can try it out. We have like all the models that we are trained on and all of that be presented as well. So, you have the UVQ to try it out completely. Like anybody can try it out right now and use it on specifically user generated content. And we want to do the right way for UGC, even in the compression field. So, we have proposals with a OM and others to incorporate some amount of UGC content, as they're building newer and better compression algorithms in the next stages of correct life cycles, that they incorporate some amount of UCC, in their at least test models, so that they don't completely focus on one aspect of the video, because really in the last decade or so, I think we've seen like user generated content grow, right? So it's moving.

Zoe: I may have one before, like time goes fast, but then I'll let Thomas before, but wait, I have one question because you do mention that YouTube short is quite being handled a little differently, at least the creator intent is handled differently. So, I just wonder for YouTube short, 'cause that's a new genre I believe, this is the most, at least for the recent years, you did mention not only the metrics, but also you have three nets for the conventional YouTube video. So, you mentioned there's a content net and a distortion net, followed by the compression net. I just wonder, what specialty you adopted while handling the YouTube short? And I know time goes fast, but this is very new things, not relatively new, but YouTube actually is getting, I think you have the data getting popular.

[00:46:51 Specialty adopted to handle YouTube short]

Balu: Yes. Yeah, quite a bit actually. And you can see like the numbers are there on YouTube pages, for everybody to access. I think the more interesting point here is that, the shorts is not just something that is different in terms of how we process, but it's also different in terms of how the creators and viewers look at it, right? Creators, the way they engage with it and viewers, the way they view it. It's one of those like completely mobile first platforms, right? Which is essentially vVertical video is the way to go, from both creation and consumption, which is slightly different than traditional horizontal video. You know, your 16 by nine videos, your three by four videos. They've had a particular set of notions like, you have the rule of thirds, when you're generating the video, the traditional video. People have certain norms on creating the best positions for, if two people are talking and you're recording that, there is a way forward how to best approach the most viewability of that, right?

And, those norms break down in vertical video, especially when you're converting a horizontal video, to a vertical video that gets even more trickier. But, if you were to approach two people talking and having a conversation, the idea is that like you can't have both of them in the same lot. So, you have one on top on one on bottom, right? So, that is becoming a little bit more common with the vertical video sort of a thing. And also, when it comes to subjective understanding of the video, the positions and the places that a person's eye looks at is different from your horizontal TV style viewing, versus your vertical mobile phone style viewing, right?

So, where people focus on, where they concentrate on, that's the areas that need to be, the right quality in terms of how you measure it and how you understand that quality, right? And so, as we are looking at these videos, that are purely vertical generated and vertical consumed, the way we process them becomes slightly different. Your traditional optimizations within the parameters for the content are different, because they're processed accordingly to where you have these, what are they called? The monitors, right? The quantization monitors and triangles and whatnot, right? So, those things are slightly different.

The other way this is different is for creators now, it's not just the creator intent changes slightly, it's not just about like the signal quality, but it's more feature quality, right? So, a lot of times, two things are happening very heavily. One is text and emojis and overlays have become so much more than traditional video. My reactions could be an exploding fireball, or something like that. My reactions could be a smiley face, or a crab, right? My reaction to a particular shot, like the shot itself when you look at a video, has a lot of overlays on top of it, including text, right? And so, now it's becoming more of a problem of how do you, do you have the ability, to treat individual layers differently? And if not, if all of those mixed together when the users uploaded, when the creators uploaded, then do we have a way to better compress those?

Because now you suddenly have, your screenshot style videos, if you had a laptop and you did screen scraping a lot, your desktop where there is a combination of pictures as well as text quite a bit. And so your high frequency data increases substantially within these overlays, with these overlays. And then the second problem here is that, a lot of it is also become a reactionary, right? So, there is usually, if you consider a one minute shot for example, there's usually a 32nd YouTube clip, that is being played in the back and a person sitting in front, reacts to that video exclaims and pauses, or like talks over it and or says, "Yeah" and then there's other things happening there, right?

So, that reaction becomes more and more relevant now. And so, where we have these combinations, then how do you mix them well? How do you enable like the right transitions are happening correctly? All of this becomes relevant. And so, compressing those, understanding the quality of those and compressing those, has become slightly different from even regular UTC. And so, creator intent is still, the high quality creator intent is still kept, where you do want to create good text in those shots. You do want to create nice looking overlays and effects for these kind of shots. But at the same time you need to keep, the quality of the traditional content really high too, and so that's where these differences come in. Predominantly processing vertical videos, predominantly processing, like how do you enable quality understandings with all of these additional components that are not purely video signal, right? So.

Zoe: Yeah, a great uniquely stand there, with this vertical agencies and mainly consumed by mobile phones. I think Thomas, you have something? 'Cause I know times went by, but actually I noticed that it's really, we got a lot of content. I still actually have a couple of questions I want to address if your time allows?

Balu: Yeah, I can go for another 15, 20 minutes, yeah.

Zoe: Oh, okay, great. 'Cause I was thinking that, because usually our episode is long last summer, between 30 and 40, we actually cut the video into two. We pick out-

Balu: Oh no please, edit it well so. Okay. I'm perfectly fine if you cut this down to 10 minutes. That's fine.

Zoe: No worries. Go ahead Thomas.

Thomas: Yeah, so I was thinking of, well I wanted to ask you, you've got all these sophisticated metrics that correlate increasingly well, with subjective quality as judged, by your expert panels, by your crowdsourcing and whatever. How do you close the loop? Do you see that when you apply these algorithms, that they actually drive greater engagement with the video and viewer numbers go up, or pleasure goes up in those videos? How do you measure the whole package together?

Balu: Yeah, that's wonderful. And that's stepping into a territory that I can't unfortunately talk much about, which is internal metrics on how we measure certain things. User engagement as pretty much everybody can know, or already do know because of, I think it's been public for a while, has had different ways of doing this, right? Including using our own platform to serve surveys and stuff. So, when you watch YouTube once in a while instead of an ad, you get a survey on on how good the quality of the videos have been, right? So, we have been doing some amount of that and that we see those numbers going up, right? With the quality and the satisfaction levels go up.

We've seen in general engaged users and the way, we look at like how much people are viewing contents that are higher quality, or improved quality from our situations go up. We've also actively talked to the creators who are engaging with these and the creator happiness are, dissatisfaction has gone down, right? So, the creator happiness has gone up and the number of problems, slash complaints with, "Hey could you do this slight change, "or could you improve the one minute here in between, "that doesn't look so good?" Those sort of things have gone away, right? And so, it's not just on the higher end of things being better and increasing, it's also on the lower end, where we're fixing a lot of issues that might happen, that could or usually ended up happening before, some of these correlations and some of these good quality metrics were put in place.

Now, the tricky part is, if you want this really done correctly, you want that metric to be like pervasive basically, across the board. You want it in our pipelines, which I think it's all in there. You want it in the compression engine as well. You want the RD optimizations to be using perceptual oriented metrics, rather than pixel oriented metrics. You want some of the feed, within the user's view, like recommendations, all of that being correctly identified with this metric, right? So, that sort of a thing has to happen completely. And we are trying to get there, I think with the combination of like us, Facebook and a few other companies in that AOM consortium, we're trying to incorporate some amount of perception-based metrics there.

We are trying to make sure that the compression engine not just looks at, pixel to pixel changes, although rightfully so, it has done correctly for so long, right? Because you have a good video to start with, you want to get to that good quality as much as you can. But we have trying to incorporate this idea into the correct world, as to not always is your reference good quality. And so, how do you do a good compression? How do you save your bits when you're trying to recreate artifacts? Is something that is very tricky to do. And so far the question has been, if we propose that we do this, do we have a good metric that we can stand by? And until UVQ came along, that's been more of the concern. And now that we have UVQ, I think this is something that we could get more into, is to have drive some of the compression RD optimizations, using the perception metrics correct.

Zoe: Yeah, so we really want to respect your time, but I just because we have been really talk about videos and then there's also audio part and so just wondering, with such a large scale of YouTube and then with a different regular YouTube videos, plus YouTube short and then what kind of, I would say, overall challenges then, or main tasks you have accomplished on audio part?

[00:58:14 Challenges on audio part]

Balu: Yeah, audio is also like pretty big challenge. I think the audio, the challenges with audio are slightly different than the challenges with video. The challenge, one of the things that is different is like the quality is more closely tied to bit rate in audio. So, most people think of like, 256 kbps audio as a particular quality, as opposed to 128 kbps. Now, with UGC we have done enough tests with VBR, the variable portrait versions to find out that, audio-wise we can create the same perceptual quality, with 220 kbps for example, than a 256 kbps one. But when you provide 256, when you provide 220 kbps to somebody, they think internally it's lesser quality than 256 kbps, right? So, it is very difficult to break that assumption. And it's so much more closely tied and most of their quality numbers are dictated by bit rates in audio. And so, the challenges are not to generate those audios. The challenges have been more to do with, how do you deal with formats? How do you deal with format conversions?

And so special audio is getting bigger and bigger, right? Unfortunately, some of the open source initiatives that have been successful on the video front, have not been as successful on the audio front. And we do rely on some of the proprietary methods there and audio is a lot more proprietary, especially if you think of like, ATMO solutions from Dolby, or some something like that. Or even from Hoffer's great suite of solutions. Most of these are proprietary and most of these are intended for a particular set of use cases.

And so, getting UGC in there and ensuring that it works well with UGC is a little bit more trickier in audio. The format conversions also, I mean we do a pretty good job at the, up until the 5.1 right? To studio and mono conversions, the spatial audio. We do a decent job with third order ambisonics, with the open source side of things, but it is fortunately, or unfortunately, one of those things which we end up relying pretty heavily on close form solutions. Like if we really want to go to the Atmos route, and how does it fit well with that? There are components within the YouTube space, that we cater to the Atmos, especially the high end creators, most movie studios for example, they do use what we call premium partner content, or high value content HVC.

Most of these, a lot of these, premium partners that we work with, especially Hollywood studios, we end up working with like all formats. We don't have any sort of restriction towards, open versus non-open. But having said that, audio has been tricky in terms of quality evaluations. Like we don't have like a really good perceptual quality metric for UGC. There are like the peak and PVQ and Whisk wall, which do a pretty good job with perceptual audio in general, but that's for well curated videos. And when it comes to UGC, again, similar problems happen with the audio as well, clipping, saturation, when there is a loud noise when the camera, as the camera is recording when somebody hits the camera, the noise is pretty jarring, that it's very difficult to get the right quality on that.

We do have a good set of algorithms that we try and include in there. Especially like things like dynamic range compression, things like ensuring that audio loudness, is equalized across the videos, or the playlists that you have in your videos. But until we have somewhat similar to UVQ on the audio side, so until we have a better quality understanding there, I think it's difficult to transition, a really good experience to something that is more metric driven. Because perceived experience is substantially, different in the audio space.

And, I think there's been studies like early 70s or mid 70s, in Hollywood, which showed that audio is at least 50% of the experience. A good example there was, if I had good quality audio and a really bad quality blurry video in the background, I'd probably still see and understand what's going on. If I had a very good quality video and audio is choppy or not there, that's a much substantially lower experience for me, while watching that video. And so, that is critical right? And that plays a pretty big role, if not the same, that the Hollywood folks found in the 70s, if it's not the same percentage, it's a little bit higher with UGC I would say. And audio becomes pretty critical. And there is a short side of audio, which is essentially, you do pretty heavy mixing as well.

One of the biggest challenges there is, on the vertical video and short form content, you have audio and music and others coming in from different labels and artists, that you need to mix and match and figure out like what the right content is, for the narrative that you want, right? For creators and so providing that as a mechanism, for like folks on short's creation side, becomes really critical. There are still problems I think, not that there are not problems, substantially some of these problems are, could be addressed if you have better understanding of audio quality. They could be addressed if you have better understanding of remixes done well and more to come there basically.

Zoe: Okay. So, more to come down there and then we can put a closure for today's interview. We really appreciate all the times that, all the details sharing stuff because, behind the large video UGC sharing, plus the vertical YouTube short and there's is lot of technologies driving, the user experiences to happen to the best. And then I also, I think we also learned about the concept, the strong concept of the creator intent. So, we really thank you for your time Balu. And then we also, thanks for everyone, to pay attention to this episode.

Balu: Thank you so much Zoe.

Thomas: Thank you so much.

 Balu: Thanks for the opportunity, yeah, bye.

How much video is being handling
UGC uniqueness
Qualitative information about problems with videos and to fix them
Intent to have creators chime in on their videos to have better quality version
Specialty adopted to handle YouTube short
Challenges on audio part