Welcome to The Video Verse podcast, the show where we discuss everything related to video. On this episode we'll be talking about AI and machine learning and their effects on video compression with our special guest Ramzi Khsib, a Principal Software Engineer at AWS Elemental.
Watch the full video version
Learn more about Visionular and get more information on AV1
Ramzi: Hello, I am Ramzi Khsib. I'm a Principal Software Engineer with AWS Elemental. I've been with AWS Elemental for over six years, and I've been working on video compression for the last 18 years. I mainly focus on video quality, any video compression algorithm out there. And my job is to make picture look good. I spend my day looking at video in a silent form, always on mute, and I know every person on the crowd run probably by nickname.
Mark: That's great, I love it. Ramzi, welcome to Inside The Video Verse. You know, only a true codec engineer can say that I watch video eight hours a day with no audio. People say, what are you talking about? Must be such a boring job. Oh, wow.
Ramzi: The interesting part is you catch yourself at night after work watching video without audio, and that is, that is wonderful. Like, you cannot try read on the lips, I guess. It's our trade. Any video compression will have this.
Mark: That's right, yeah, that's right. Well, well that's great. Well, really, really awesome to have you on the show. Thank you for coming on. And, you know, when we were discussing what to talk about, I think immediately you said, let's talk about AI machine learning. And it's, of course, a super hot topic in any industry now, but especially in video compression. So, you know, let's start just with a general, you know, maybe a general, almost like an explanation. One of the things I do find is that, you know, we just had NAB, for example, and you walk down the aisles on the show floor, and there isn't a booth, regardless of what they sell or what they do, that doesn't have AI or ML, or you know, something related to that as a feature of their product. So I think a good place to start is, you know, what is the difference when you hear artificial intelligence, machine learning, deep learning, you know, define it for us in the context of video compression.
(02:57 AI is like the umbrella term)
Ramzi: Sounds good, I think I will start by AI, because that's the broader term. And AI, in general and definition, might vary. It is a term used to classify that machines that try to mimic human behavior, mimic meaning learning or simulating any human task. So AI is like the umbrella term for anything that machine will take over and do any human task. Under the AI, there is the machine learning, and machine learning is a subset of AI. And what machine learning does is we try to learn from past data. And it uses massive data. It use training, use learning, and the objective is to predict the future or to make decision.
A smaller subset inside the machine learning is deep learning. And machine learning has been around for decades. It's not something new. What is new is deep learning. And deep learning is tightly coupled with the idea of neural networks. So with the advent of neural networks, and the AlexNet in 2012, where it did show a great achievement by neural networks. Now deep neural networks are the kind of the hype of the moment. So people use the words deep learning, machine learning, AI, interchangeably. I think we are in machine learning phase. I think AI is probably far out couple of decades. But this is like, I think this is, we are in the era of machine learning.
Mark: Yeah. You know, there's a lot of really great scientific and academic work being done, especially in the area of video compression, and with machine learning, deep learning, you know, AI as you've outlined, there is no silver bullet. You know, I think you actually said, "There is no free lunch." So I would love, you know, for us to spend some time talking about how, let's start here. How does one separate, I'll call it the hype from the reality, but what I really mean by hype is like the academic research, you know, which is a little bit hyped, because often it really can't be deployed commercially for any number of reasons, you know, it can just be, it requires way too much compute then is feasible. Maybe the bang for the buck is actually not there, so for what it costs, you know, to actually execute the algorithms, there's other ways, other more efficient ways to get there, even though it may be less sexy. But let's start there. You know, what do you think about when you look at this, and separating kind of that hype from the reality of what can be used in commercial use?
Ramzi: I think you hit the quite essential point in video compression, and machine learning, and the hype around machine learning. I'll focus my answer on video compression, because audio is different. NLP, natural language processing, is different. And typically video is kind of the trailing industry. It usually starts with image, and then you see the repercussion, or the fruits of that research in video.
In video we typically use models, like we use thresholds, we use heuristics, we use these models, and these models have been around for quite some time. The problem with these models is they're not adaptable, and video covers a wide spectrum of use cases. And with the COVID, the spectrum got even more exponential, and you have all formats, all resolutions, and even vertical video. Who would believe that vertical video is a thing? Apparently it is today.
So machine learning came in the adaptability, making these models we use in video compression, make these thresholds, these heuristics adapt to content, adapt to scenarios, adapt to use cases. The problem with machine learning, and you stated that problem is compute. The deep learning coupled with video compression, especially video, if you wanna do real time processing, deep neural network, it's expensive, both on memory, on CPU, or companion GPU, or any hardware. So typically right now it's hard to see these deep neural networks and apply to video compression. So then that's the hype at the moment. It might change in few years. It might change with the new hardware, with quantization of the neural network.
But what's happening today, what we have today, the reality today is we're using machine learning, we are using classical, I would say, machine learning, and things like VMAF. VMAF is built upon machine learning. A lot of non-reference metrics are built upon machine learning and those are running real time. So the difference the hype resides in, is in production, or still in in research and academia. And I think that's the fine line between what we think is usable, as of today, because I'm pretty sure things will evolve in the direction of wider use.
Mark: Yeah, those are great points. You know, one of the questions that I have, and I'm sure our audience does as well, is it really just an issue of compute needs to catch up to a point that these algorithms can be run fast enough or efficiently enough? In other words, are are we just waiting for faster processors, you know, more cores? Or is there something else that needs to happen? Can you explain that a little bit more about what we're really waiting for to be able to cross over from, like we said, hype or academia to actual real world use, viable real world use?
Ramzi: I can think of two examples. I will take the example of deep learning-based video compression. So this system is end to end, so you encode, so use deep neural network from end to end. The idea is quite simple, is we gonna replace the entire video compression pipeline with these compute-intensive deep neural networks. And the question is, you have two dimension, or probably, actually they have three dimension. You have the compression efficiency, and you have the rate distortion typical, and you have the compute.
So for deep learning-based video compression, you have to beat, in terms of compression efficiency, what we have as classic non-machine learning, or partial machine learning methods. So you expect machine learning-based codec to beat the VVC, the AV1. That's the first target. And you have to beat those codecs by reasonable margin to justify the compute. If you are at par, or below, or probably a couple of percent better, I don't think it's viable. That's one aspect.
So you have, in terms of bitrate, we love to call bitrate. I think, we always have this discussion with my peers is, for deep learning-based video codec, how much bitrate do you expect that to be to sell it? We typically see from generation to generation of codec, like we love to say, oh, 50%, or 33%, and there is debate around that. So I think machine learning-based codec has to deliver the 33 to 50% bitrate improvement.
Mark: Yeah, yeah. It's super interesting, you know? And as you're working with your colleagues, especially those who are really spending time in the scientific community and academic, do you think or are you finding that they really factor in at all how deployable what they're building is? Or do they just sort of, are they, I don't wanna say stuck in, because I'm not saying in a derogatory way, but you know, let's face it, when you're doing research, you're of a very different mindset. You work for AWS Elemental, you know, you operate, or at least the platform, operates video encoding for some of the largest services in the world, right? Obviously there's things you would love to do as a video coding practitioner that you can't, because they're not feasible for the reasons you just stated. Do you think the scientific community, the academic community gets that? Or are they coming around to that? Or are they still sort of stuck in theory?
(14:48 Training cost)
Ramzi: I think there are two paths of research. So unfortunately the academic research is completely decoupled from the reality of production, deployable. And one of the things I think we tend always to forget is we always factor the compute cost, the hardware cost, but we kinda forget the training cost. It does cost money. So that is always, and I always go to the CVPR challenges, like the CLIC challenge, the Tier challenge. And nothing like in the training is factored in. So you have this beautiful model. If it took two years to train. I think I personally, and my peers wouldn't wait two years to, I'm kinda exaggerating, just to to make a point.
Mark: But your point stands though, yeah, exactly.
Ramzi: And I think that research is completely decoupled. The other research areas I'm seeing where more reasonable effort is the effort done in AV2 and JVET and NBC. And they try to incorporate machine learning techniques in the standardization process. As of today, I think the trials, there are a couple of trials on JVET, JVET has more mileage. AV2 is catching up.
They formed a focus group, number four, and the discussion always turns to the point, there are a couple of questions. And I think these questions are essential, is what to standardize, how to use it, how much compute, and what benefit, like what percentage of bitrate it has to bring? I think in one of the meetings there was this statement from a fellow from Google, I didn't catch his name, and they found out that, like a small neural network, like MobileNet V1, around with two million parameters. This is fairly small for like, this is nothing compared with the typical large, it would take more silicon area than an entire AV1 hardware decoder.
Like this is the basic unit. In terms, like I like to put, if you do some classification, MobileNet V1 will get you 70%, 75%, and that will be equivalent, in terms of silicon, to AV1. That is the struggle and that is the limitation of deep neural network being used in that. On the other hand, like I said earlier, you can still use machine learning, even in real time. The way VMAF non reference metric, use machine learning, I think is the first step to incorporate machine learning in video compression.
It's, I think deep neural network will come to video compression one day, but right now it's hard to, I think it's hard to justify. And the results, like just to put again, numbers, JVET, they did the meeting in last month and they found out that with the some of the tools would get 11%. And I think that's still too little for the compute.
Mark: Yeah, yeah, yeah. Not enough bang for the buck, you know? I wanna talk about VMAF, but we'll circle back around to it. One of the ideas that I have, I was gonna say theories, but it's not a theory, is that, you know, we are marching on this cadence of every, well it used to be 10 years, that a new video standard would be released. So you think about MPEG2, and then, you know, H264 2003 and then 2013 HEVC, and you know, actually it compressed a little bit with VVC. But with each new standard is like a 10x, or a 20x of complexity.
And obviously, you know, those that are optimizing the codecs, you know, are getting to work very fast. I mean, you look at how quickly AV1 went from being, you know, a laughing stock, you know, like a minute of frame, you know, to now actually in some settings as fast as X264, and still with bitrate savings, you know, so still with the benefits of the codec standard. So this is improving.
But my question to you is, one of the ideas that I have is, that at what point do we need to shift our innovation from developing the next standard to developing more efficient machine learning models and advancing, you know, deep neural networks to get that increased efficiency from the existing standards?
And the obvious advantage that I, because I live largely in the world of, well, largely, exclusively, I'm working with companies, and around platforms and services that are delivering content to millions and tens of millions, and in some cases hundreds of millions of users. And so you always have to think about, you know, how big is my playback ecosystem? Do my customers have devices that can even support this? You know, this new format, this new standard. And it seems to me that we're at a place now where, you know, now HEVC is becoming almost ubiquitous, very quickly anyway, and especially with the upgrade cycles, at least in North America, and I think Europe and many parts of the world. People are on a current generation iPhone, or maybe just a one or two old generation, which all support HEVC.
Do we really need to move, you know, to VVC or some other standard? Or should we be investing, just as an example, in more efficient ways to, you know, to squeeze bits out of HEVC? And this is my question to you, and I'm wondering if you have thoughts on this, if you thought about this very same thing. Because it has the advantage that if I use machine learning, it still is an HEVC compliant bitstream, it's just 30% smaller, you know, or some percent smaller.
(22:02 Build rate control)
Ramzi: I think that relates to the effort that I, with my team, spent for the last three years. And one of the questions we always had is how to use machine learning as a powerful tool, and also trying to solve problems that our customers are facing. And the question is, should we move to a new codec, or should we stay on the same codec? I'll answer it like this. There are features in any codec that you have to build. You have to build rate control no matter you have MPEG2, VVC, you have to build rate control.
So if your rate control is not accurate, is not robust, I think it's not ready for, the real world is not ready for primetime. So instead of, I think, instead of thinking about, there are building blocks inside the encoder, no matter what codec, it has to be, or could be codec agnostic. So what I did three years ago is we have rate control, and instead of having versions of rate control for MPEG2, for VVC, HEVC, you name it, let's unify it, let's use this powerful tool that can predict in the future. And it was not freelance, trust me. Like it took months trying to figure out. But there are a lot of lessons learned there.
So I think one of the lessons, one of the lessons learned in this effort, first of all knowledge in the machine learning, you have to understand how these models work, you have to understand, that is the first lesson. Second lesson is design. Driving a sports car does not make you a great driver. You have to learn how to drive. So that's the second lesson. And I think when you keep in mind that this model has to run real time, it's part of the design. It's not a requirement ad-hoc.
So to answer your question, I think having a new codec is great, probably is future proofing our industry. But we have questions, even with MPEG2, we have questions of rate control, ABR, a lot of interesting question, motion search, adaptive quantization, perceptual optimization, all these themes could leverage from machine learning. And I think it has to be, the effort of research has to go in parallel with a future codec. Future codec, I think, it's essential that we might need, like we needed the HEVC when we moved to HDR and 4K.
We might need the AV2 in five, 10 years. But at this point people are watching Super Bowl, so people wanna watch Super Bowl in 4K, and we need to solve that. And nobody will wait, or like the streaming when the quality drops. Nobody likes that. So we need to solve these problems, and like I said, some components are codec agnostic, and you have to address that no matter how advanced the codec is.
Mark: Yeah, interesting. And you touch on this, but you know, I think it's worth even highlighting a bit further is, is that the answer, this is my most favorite, you know, anyone who knows me knows me, know that I, you know, it's a little bit of a joking statement, but it's very true. You know, whenever a video practitioner is asked a question, the answer always is safe to say, it depends. What are you trying to do, you know? So the point though here, in all seriousness, is that, you know, what you point out is that there is an application element.
So, you know, are we talking about a live encoding workflow? You know, is this a, just earlier today I interviewed someone and we were talking only about WebRTC, you know, about RTC applications, and we were talking specifically about how the way that you think about optimizing an encoder and making coding optimization decisions, and RTC is very different than for VOD. So that's, you know, one of the points to highlight there. Now Elemental, I believe in all of your encoders, supports a QBR function, right, which is sort of a content adaptive. It's QBR, is that the?
Mark: QVBR, that's right, QVBR. So can you comment, are you using machine learning in QVBR? And can you give any insights into, you know, how that system works?
(27:11 How QVBR system works)
Ramzi: Sounds good. So the idea of QVBR is we identified cases that our old rate control was wasting bits. And the other hand, we identified that not all content are equal perceptually, and not all content need all those bits, and some of the contents are not getting enough bits. And it comes down to predicting the perceptual value, or the perceptual importance, of that segment, or video, or group of frames. So at the heart of it, it's a rate control function. And what's different from QVBR, to CBR, to VBR, is what are the constraints? The modeling inside it is still QP versus bits. And that is based on rate control, and based on rate control, so that's predictive function, and with the feedback loop and all of that.
So we did apply a lot of machine learning inside our rate control. So what machine learning brings into the rate control world is the high adaptability to all scenarios, all contents. So what changed from one function to another is retraining with the new scenario. So that is what we applied in the QVBR. But of course there are some secret sauces in there.
But the global idea is perceptually not all content is the same, and you have to adapt to content. And this is in the trend of content adaptivity. I think Elemental's not the only one. And we won an Emmy, actually, for this, and with many, many other fellows. And I'm so happy to see my fellow, a video compression artist, win the same Emmy. And the idea is, like I said, is you have to adapt to content, and it's a function where you take the perceptual importance and you have to determine the number of bits.
Mark: So now did you create a new quality metric, or how exactly are you, because presumably you're controlling the QP, at what, the frame level, I guess, or maybe it adjusts. But you need some way to measure qualities so that you're not introducing artifacts, or leaving bits on the table. So what's the mechanism there?
Ramzi: I think the quality, in definition of quality, you can define quality is the point where you start seeing artifacts. You can define quality. And I like to, like always this, this example, if you have an SD content, like a PAL, and NTSC, and you code it MPEG2 and you give it a 18 meg, you wouldn't see the difference with the AVC at 5 meg, it would look the same.
The key point is defining where the just noticeable difference or distortion. And I think that is the key definition in what we do. So we tune the model in a sense to define what is the viable QP for that particular content. And our adaptivity, and I think we went down to the coefficient level. So we do it on a scene base, of group of frame, or GOP based frame level based, block base, and we go even to the coefficient level.
Because we think, even within same frame, the content is not the same. I mean the perceptual importance, like right now we're talking, I think the certain part is we are talking the background. You can steal some bits from here, from here. So we applied those, we applied those concepts into the QVBR.
Mark: Yeah. Awesome. Yeah, that's super interesting. And I said earlier, I wanted to talk about VMAF, because you specifically pointed out that VMAF is a machine learning-based quality metric. And so I'd love for you to talk just a little bit more about that. I'm quite certain that all of our listeners are familiar with VMAF, and even use it. But maybe you can share a little bit more about what's really going on under the hood, because I wouldn't assume that everyone knows that. So what can you tell us about how VMAF actually works?
(32:42 How VMAF works)
Ramzi: The idea of VMAF, I think, is you want to fuse, or use different metric, that model different ways to measure quality. As an example, for instance, PSNR is an excellent metric to measure fidelity of the signal, but it has no perceptual value, or importance baked into a PSNR. So what VMAF tries to do is tries to fuse, give some weight to different metrics, depending on content.
As an example, if you have high textured video, I think PSNR will drop quite a bit, and PSNR will not be reliable. So you will mostly give weights to another metric, SSIM, or VIF. So the trick of, and the the genius part of VMAF is, depending on the content, you give different weights to the metric, and the training was done based on a human observer giving a MOS score.
So that's VMAF, and the model will try to determine, for each frame, the weight of each metric. And what is interesting in VMAF, its simplicity that uses under the hood these metrics, again, it's also its shortcoming. So because it's the weighted average of these metrics. So if these metrics are not quite reliable enough, so VMAF, there would be the weighted average of that.
Mark: Hmm, interesting. Well, so are robots taking over? You know, I have to tell you, so, Dror Gill, you know, who's the CTO of Beamr, he jokes all the time that we're heading into a world where, you know, robots are gonna be creating our video and robots are gonna be watching the videos, so the humans can go out and play, they can go out and enjoy the sunshine. So anyway, it's a fun joke. But, no, I think it's a really good way to wrap up our conversation is, well, you know, what is, what's the future for machine learning? I think we all can agree that, you know, we are a long ways away from these purely machine learning, AI based codecs. And I know there's some, you know, I'm aware of a couple projects that are really groundbreaking and very academically stimulating. But when you look at the, you know, the real practical requirements for deployment, they're a long ways away. They're not coming anytime soon. But I'm curious from your perspective, and you know, especially as a very large commercial vendor, very important encoding supplier, you know, to the industry, you know, what's the time horizon? What do you see in the future? What's coming soon and what's still 10 years out or longer?
Ramzi: So robots are not taking over. This is my personal view.
Mark: That's good news. In my opinion, in my humble opinion.
(36:24 Robots are not taking over)
Ramzi: I will quote and kind of segue to my answer, I'll quote Michael I. Jordan, not the basketball player, the machine learning scientist, "The revolution of machine learning hasn't happened yet." So it will happen, but not yet. So like I said, video is kind of the trailing industry compared to image. So what I think we will see in the future, is more use of machine learning, but more focused on the perceptual aspect. The vision aspect, the coding tools. I think with VVC, AV1, or HEVC, I think we have solid codecs. What's missing there is how to tune those codecs perceptually.
What I mean perceptually is if you have film grain, if you have graphics, if you have sport content versus, I think there is room there. And we, at Elemental, and I'm pretty sure everybody in this industry seeing there is potential there. So the perceptual element is still missing, or probably missing is, probably it's there, but there is a large room for improvement. So I think machine learning will shine in that area. I think 10 years, or probably five to 10 years, the deep learning-based codec will start to reemerge when they solve the practical aspects, like standardization, what do you send the network, how do you compress the network?
And one key, and I shout out to the research community, quantization parameter is the key parameter in any video codec. Start with that. Start with the QP. And I think those two letters are the letters I've used most in my last 18 years. So I think research machine learning has to start with those two parameters, two letters, sorry.
Mark: Yeah. Yeah. I couldn't agree more. And I think that's just an awesome way to end, and so wrap up this episode. Well, Ramzi, thank you so much for sharing. You know, this was really, we covered a lot in 30 or 35 minutes here. So maybe this is one of those the listeners will have to listen twice to, one of those episodes, but that'll be great. We'll have you back for sure. And you know, maybe we'll do a part two, or we'll find some other interesting elements to talk about, you know, around video compression, video encoding. It's a very fun topic, so yeah, thank you again for coming on The Video Verse.
Ramzi: Thank you, Mark, it was a pleasure. Thank you, Mark.