In this special interview with Debargha Mukherjee, Principal Engineer at Google, you will gain critical insights into the development of AV1, including three tools that give AV1 a unique advantage in the market. You’ll want to listen to the end where Debargha shares secrets for how to get the most from a new codec standard so that you can be assured that your codec evaluations are as accurate as possible.
Watch the full video version
Learn more about Visionular and get more information on AV1.
Debargha Mukherjee- So I'm Debargha Mukherjee. I'm a Principal Engineer at Google, and we worked on AV1 in Google between 2015 and 2018. The core engineering team included me, then names such as Yaowu Xu, then Jingning Han, Zoe Liu, who is currently a co-founder in Visionular.
Mark Donnigan- Yeah, I am really excited about this conversation. I kind of know the history, but I always like to hear it from those who were actually there, rather than second and third hand. Tell us, what is the history of AV1? How did the standard even come to exist?
Intro to the development of AV1.
Debargha- Yeah, so, there was a company called On2 Technologies that was acquired by Google in 2010. And at that time, On2 had a codec called VP8. So the first thing that was done by Google at the time was to take VP8 as is and make it an open-source royalty-free codec. So that was the birth of the WebM project. And it started off with VP8, and VP8, as you may know, is still being used in many real-time video applications, all over the world.
So then, right after VP8 was released, we started working on VP9. And the goal was to release VP9 at the same time as HEVC, around 2012, or 2013, and around that timeframe. Which we did. So VP9 kind of was competing with HEVC, and it turned out to be working pretty well.
YouTube then decided to use VP9 and started streaming VP9 in 2013 onwards. And slowly, over the next two or three years, the number of videos that were available in VP9 that were being consumed daily, grew to billions a day just because YouTube was streaming VP9.
On the back of the success of VP9, then what happened was we started looking at VP10, a further generation from VP9. But at the same time, there was a lot of interest in the project from companies that also have a big presence in online video delivery, such as Netflix, Amazon, then Facebook, and so on, and Microsoft.
So what we decided at that time was instead of doing VP10 as part of the WebM project, to create an industry consortium. So those negotiations started in 2015 and by the end of 2015, what was born was an industry consortium called AOM, or Alliance for Open Media.
Okay, so, and then it was decided that VP10 would no longer be released, but instead, whatever you have done in VP10, would be proposed for this AOM codec. And the first AOM codec was supposed to be named AV1. So there are two other projects that were from Cisco and Mozilla at that time, that were also aiming to achieve a highly competitive royalty-free codec. And those are called Thor and Daala.
So the idea was for the Thor and VP10 projects to be combined and to produce one codec that would be part of the AOM umbrella, and that would be called AV1. So the work on AV1 started in earnest in early 2016, and then it was worked on between Google, Cisco, Mozilla, Microsoft, and several other companies, including Netflix and so on. And then the bitstream was finalized in 2018, around the middle of 2018. And that was roughly five years after HEVC and VP9.
And that achieved, at the time, about 30% more compactness (bitrate efficiency) over VP9, which was what our original target was. At the time AV1 was released, I think among standardized codecs, this was the best, and we had pretty high confidence that this was the best codec, in terms of compactness, at the time. And now, like going forward from that, we are trying to adopt, get AV1 adopted more widely in the industry, and looking forward to the next generation of an AOM codec, in a couple of years' time.
Mark- Here's what I've always wanted to know. So how in the world did Google codec engineers, Facebook engineers, Cisco, Mozilla, and all these companies, come together, and first of all agree, because everybody has their own agendas, right? And their own businesses, their own use cases even, that some are very complimentary, some are very different. And then how did you work on this together? Did you literally all get in the same room for a certain period of time? Were you meeting, like, weekly? What did that look like?
Debargha- Yeah, so the process was kind of different from the MPEG process, the MPEG process would be like, there'd be meetings every three months in some exotic place where people would get together for a week, and then they would fight, and then after that go for lunch or dinner together, and so on. But in AOM, we are not doing that, instead, we are doing weekly calls, weekly calls, but then from time to time, let's say once in six months or a year, we would have an in-person event, where we would get together, and a lot of the pending items, like pending proposals, or pending resolutions of experiments and all would be discussed and resolved one way or the other over a short period of two days or so. So that is how we did it.
Of course, the processes that we followed at the time were not as structured as an impact process would be, but then it was kind of like a learning experience because that was the first time we were doing it. Now those of you who are familiar with the way the next-gen codec from AOM is being developed, it's much more structured now.
But, that also allowed us to move a lot faster, which was actually good because we wanted to release AV1 within five years of VP9, which we were able to do. Now in terms of working together, of course, when we have proposals from multiple companies in the same area of the code, or the same area of the code base, there would be some amount of competition, which is actually not unhealthy.
It was actually good because we pushed each other and pushed the other proposal proponents to actually step up their game and make things faster or simpler and all of that, which was actually a good thing. And the one thing is we are not competing on royalty. So that is actually a big difference. So took a lot of the politics out of the game. It was mostly technical discussions and technical reasons driving decisions one way or the other, which was a good thing to have.
AV1 foundation - VP9, VP10, Thor, Daala.
Mark- How was VP9 chosen, or VP10 ultimately as the foundation versus Thor and Daala?
Debargha- So at the beginning of the AV1 process, I think we had a VP10 code base, which was built off of VP9, so the code base was already ready, it was already performing better than VP9, VP9 was, let's say, equivalent to HEVC in encoding efficiency. And this was getting maybe 10, 15% better than that already.
Daala was an image codec, it was not a video codec, so it was, the code base for Daala could not have been used right away. And Thor was more for very fast encoding schemes. So for operating in the real-time video conferencing range of the complexity compactness spectrum, Thor was a good candidate, but it didn't have as many bells and whistles that the VP9 code base had to really produce a very high-performant VOD codec, that is why starting from VP9 was a very natural choice.
In the early days of the start of the AV1 process, it was decided that we'll use VP9 with some modifications as a starting point of the code base. There are some changes that were made, mostly the transform types and all of that. And that was what we called VP9 plus.
VP9 plus was what was used as a starting code base for AV1 development. And then all the other tools that were already there in VP10, and some tools from Thor, and some tools from Mozilla, were gradually proposed one by one, as well as new tools.
Intro to AV1 tools.
Mark- Can you, first of all, give some sort of an overview of the breadth of tools that were added to AV1? and then let's talk about them.
Debargha- AV1 is a hybrid, motion-compensated video codec, just like any other mainstream codec today. It's not a machine learning-based codec from the newer paradigms that completely throw away what has been done over the last 20, 30 years. It's still a hybrid, motion-compensated codec.
Now, within that framework, you can divide the whole thing up into two or three, or maybe four different areas.
One is definitely prediction, so you have to predict - how well you predict, that's one part, and after you've predicted, you have the prediction residual and you have to compress it and code it very efficiently, that's like transform and encoding of the coefficients, that's another big part of the codec.
Then, after a decoder has reconstructed a frame by doing the inverse quantization, inverse transform, and adding it to the prediction, what you have is a reconstructed frame, but then that reconstructed frame has many artifacts, compression artifacts. So then you need to do a lot of filtering to get rid of the artifacts. And once the filtering processes have been completed, you would want to put them back in your reference buffer pool, so that they can be used for subsequent frames as a good predictor. That's the in-loop filtering part, which is another big component of the codec.
AV1 In-loop filtering pipeline.
If you had to focus on a few areas where we think AV1 really excelled, I would start off first with the in-loop filtering pipeline, because I think the in-loop filtering pipeline, AV1 is quite a bit more sophisticated than what was there in other codecs at the time.
There is this notion of inbuilt super-resolution, so I'll talk about that a little bit. So if you think of when a frame is coded, so AV1 allowed a mode of operation where a frame could be coded at a lower resolution than the original source.
If the original source is 1080p, maybe we code at a lower resolution. And if we do code at a lower resolution, then all the predictions that we do, like Inter-frame prediction, also need to be predicted from a different resolution frame, or a higher resolution reference frame. And so the prediction process in AV1 is already tuned to actually do this kind of across-scale prediction. So we can actually, given any frame, and even an inter frame, we could actually do pretty well by coding it at a lower resolution.
Now, once you do that, then we do the transform and all that, and then you get a reconstructed frame, which is at a reduced resolution. And then, at some point, in the in-loop filtering pipeline, we would actually do an upscaling to get it back to the full resolution, before we move the frame back to the reference buffer pool, so that it can be used for subsequent frames. And at the same time, we can also tap it out and also use it as an output frame.
Okay, so now, this upsampling process that the in-loop filtering does, on both sides of that upsampling process, there could be some other processing that we do.
For instance, if you have coded the video at, let's say, 720p resolution, starting off from 1080p, then your reconstructed video is still 720p, but then blocking artifacts that you see that you have to get rid of, would be at the 720p resolution. So your deblocking that you need to run needs to be at the lower resolution, but then after you upscale it, you'll do another level of processing to again, improve the quality and actually get some of the features or the, basically do super resolving rather than just a plain upscaling.
So, you can divide up the in-loop filtering process into two main parts. One part is applied at the coded resolution. Then you have the up sampling, and then you have another part that is done at the up sample resolution. So that's the overall framework that we have in AV1.
AV1 Constrained Directional Enhancement Filter.
Now among the tools that are used at the coded resolution, we have deblocking, that I just mentioned. And then there's another tool called CDEF - it's a joint tool between Cisco and Mozilla. It's called Constrained Directional Enhancement Filter. So it basically figures out dominant directionality in a small block and then filters along that direction with some components that are octagonal to them. So it's kind of like a star filter but oriented based on the dominant direction. So that's CDEF.
So you first do deblocking, then apply CDEF. Then if there is super-resolution being used, you'd upsample, if super-resolution is not being used then you wouldn't do it. After you have the full resolution bitstream, you would use something called loop restoration. So this loop restoration tool would be something that would involve some kind of learning. So right now in AV1 the loop restoration tool has a big component with the Wiener filter.
The Wiener filter works this way, you look at whatever you have at the input of that process, and the source that you want to convey, and then it finds the parameters of a filter that we learned on a per-frame basis and signaled as part of the bitstream. So then the decoder would be using the filter taps that would receive from the bitstream and apply the filter. And hopefully, the output of the filter will give you a result that is much closer to the source than you originally started out with.
Okay, so that's the Wiener filter, and then there's also a guided filter, which also can be used in parallel to that. So for every small block, you can use a combination of either the Wiener filter or the guided filter. Okay, so now, this loop restoration tool operates post-up sampling. And so it, there have to be some learning elements, so that it's, no matter how much you reduce the resolution by, it should still be able to learn something because you are sending the parameters in the bitstream.
So that's the entire pipeline. And I think all the tools that we have there, there's several of the tools here actually have not been seen in the video codec community at all. Like CDEF is a completely new tool. The Wiener filter that we use is a separable Wiener filter, which is much simpler to implement than an un-separable Wiener filter. Also, the guided filter is a tool that has not been used before in any codec in the past.
Guided filters have been used in many computer vision applications in the past, but not in a codec scenario, and the way you used them is also very novel. This is the entire in-loop filtering pipeline for AV1. I think we really pushed the state of the art through AV1.
Mark- This is really the first implementation of machine learning in a codec, right? HEVC did not have any tools that used learning.
Debargha- So, the Wiener filter kind of tools were not actually adopted in HEVC, they are now in VVC, they were not there in HEVC. AV1 is a codec in between, and it has Wiener filter based tools, but that's also just one of the several tools that we have in the pipeline.
AV1 tools for WebRTC.
Mark- I know that WebRTC is a major application. In fact, it's really interesting that, generally, a new codec standard is released and it's always VOD that is the first use case because, you know, of course you can, time is not as sensitive. You can throw a bunch of computing at it and, you know, just generally it's an easier use case, right. And yet with AV1, we're seeing the opposite. RTC is really the first use case. So, tell us about some of the RTC tools that are included.
Debargha- Well, so I think, you know, that in the RTC use case, one of the major drivers is applications like screen sharing. I think AV1 is the first codec where we have screen content coding tools, but they're also part of the main profile. So anybody who implements hardware for decoding of AV1 has to support it. And so if you have an AV1 compliant decoder, then it becomes much more convenient to support like, screen content sharing and tools like that.
We have several tools in AV1 in-built that support screen content sharing. So there is a pallet mode where like, whenever we have a screen or even graphics in a certain, or big class of games, you have blocks of an image, which only have a few colors.
If it's text, you'd probably have two colors, actually, two is not true because you have some like, anterior thing being done, so you may have maybe five, six colors. Two actually never really happen. But even if it is 2, 3, 4, 5, 6 colors, there are modes to say that, okay, only those colors would be, like, we only have that many colors in a block. And then it's much easier to signal that with, by sending a mask of which pixel belongs to which color. Rather than actually doing a traditional encoding scheme, which is not really helpful for this very sharp edge, like content, that you see in text or graphics. So pallet modes are very useful.
Then we also have Intra block copy mode. So Intra block copy mode and pallet mode, those kinds of tools exist in HEVC, as in the screen content extensions, but no one actually implemented them in hardware because they're part of an extension, and it's more costly to implement those extensions.
But with AV1, because they are part of the main profile, it's much more convenient for people to support screen content like applications with AV1. The intra-block copy mode is actually not too easy to support in hardware because of various other reasons. But we do have an initial mode that actually works pretty well if you have purely screen content. If you have mixed content, then it's not that useful in the way it's done in AV1, we are trying to fix this for the next-gen codec, but, in AV1 for pure screen content, the mode works well.
The third tool that we have in AV1 that supports screen content is the transform types. When you have like, text-like content or graphics, like content with sharp edges, then often the traditional transforms, which are really tuned for natural content don't work very well. Okay, so you have to use transforms.
You can get rid of the transforms and just bypass it with, what we call identity transform, which means there's no transform at all. And just code the differences, the residue just like a regular block of pixels. And those actually work much better because you have fewer pixels to actually send information for because you have a lot of areas that are the same, and then suddenly there are changes and then everything is the same, so only the changed position do you have to signal. But on the other hand, if we are putting a transform on that, then the coefficients would be spread out over many different coefficients and that would be much more inefficient. So, these are the three main tools that support screen content. And for the RTC use case, this is one of the major applications. We were happy that we were able to keep those tools as part of the main profile.
Mark- Obviously you get bit rate efficiency, with these tools. Is there also some computer processing efficiency, or is it mostly bandwidth?
Debargha- I'd say this is mostly bandwidth. Now the computational efficiencies, that's a little tricky because anytime you go from one codec to the next-gen, something has to give. And usually what gives is the computational cost. And what you've seen over multiple generations of codecs is, like, anytime a codec is released, people usually focus on the VOD use case because they show gains at 10% or 20%, 30%, 40% over the previous generation codec, but that is assuming a very slow encoder which may be possible for the VOD use case, but even for VOD, it can be too slow, especially if you consider services like YouTube, which have many millions of videos being uploaded every hour, every day.
For Netflix and where you have a curated few, like the amount of content is fewer, it may still be possible, but still, coding complexity is a problem. Now, when you go to the real-time use case, then it's a much, much bigger problem. I think the codec community, as a whole, needs to do a better job in actually really reducing the time between the release of a new codec specification and the time when it becomes ready for use in a real-time scenario. So typically there is a lot of engineering resource that needs to be put into that codec, starting from a very heavyweight encoder, to actually get it to a stage where you can actually use it in a real-time video conferencing scenario.
Mark- Yeah, it's very interesting what you just said because the desired state for almost all of the services and video platforms is one codec to rule them all, you know, rule the world. And, you know, you're right, if you are an SVOD service, then that's maybe more possible because you're only talking file-based, you have a relatively small number of files. And if your content value is high, you probably can afford to put compute cycles behind what it's gonna take to do very high-quality and very efficient encoding.
However, for the rest of us, that operate platforms that need to service a wide variety of use cases, what I just heard you say is what we all know, is that we're in a multi-codec world. Most people look at it purely from a playback perspective, but I heard a different angle on what you're saying.
What you're saying is, is that maybe in time perhaps, there will literally be a codec that is built almost solely or solely for live, for example, has a very specific set of characteristics, performance characteristics. And in that case, you know, if I'm operating across a wide variety of use cases, I may have that codec deployed for my RTC, something else for VOD, etc.
AV1 Compound prediction modes.
Debargha- Yeah. The third tool is actually not one tool, it's a collection of several tools. Similar to the in-loop filtering, I didn't talk about just one tool, I talked about an entire area of the code base, of the codec. So similarly to the prediction side, I think AV1 went a little bit too overboard to be honest, with a number of prediction modes that we support. And so I think one of the areas that we've focused on particularly well is the compound modes.
So compound predictions are where you combine two different predictors to create a separate predictor. So typically you can have a forward predictor and a backward predictor, and average them. That's the simplest form of a compound predictor. But in AV1, you could actually combine them in multiple ways. You could combine them based on how close the pixels are, so that's one mode. You could combine them based on the distance of the references, of each of the two references to the current frame. You could also combine them using a wedge.
If you consider a block where there's a straight line wedge going through it, maybe on one side of the wedge, you could use one of the predictors, and on the other side of the wedge, you'd be using another predictor, with some blending in between along the edge. So that's the wedge mode. And so, that's a set of compound inter inter modes, how we combine two inter modes, then we also have a combination of inter and intra.
So for instance, in the wedge case, we could say, okay, one side of the wedge would be an intra predictor. And the other side of the wedge would be an inter predictor. And with some blending in between. So that's another inter intra wedge predictor.
And then you could also have inter intra smooth predictor, which is basically saying that you gradually, if you're predicting a block, then from the intra boundaries, you gradually blend the weight. So that near the edges of that block near the intra edges, you're going to weight the intra more, but then as you go further away from the edges, then you start weighing the inter more. And it proceeds in the direction of the prediction. So there are several inter intra modes that we have there.
In this entire area of compound prediction, we have many different ways of doing it. And actually, they separately bring small gains, but if you put them together as part of a package, we have a substantial gain. Okay, so that's the prediction tools, but these are also, like from AV1, they are sort of like a foot in the door in some of these, or actually, they're pushing the hardware to do certain things that they were not doing before. But now in, I think once the hardware, the die is cast. Now, if they are able to process inter-intra in a better way, then that also helps us with future codecs by improving those tools and having much better prediction modes that use the same infrastructure. So it does offer a good coding gain, but it also sets the standard for the industry, going forward.
How to evaluate a new codec standard
Mark- Amazing, well, thank you for highlighting that, you know I always want to be practical to the listeners. And I know that everyone who listens is really gonna appreciate the history that you provided and information about the tools, but with any new codec standard, you really need to get under the hood to exploit it. My observation, even as a semi-technical person, is that decisions are made a little bit too rapidly about whether a new standard is effective. Maybe they just didn't take the time to really get under the hood and really learn and to really explore, you know, how to correctly configure and set up the encoder.
So, how would somebody who's listening and saying, yes, I know AV1 is coming, I clearly can feel the momentum – we're even starting to think about it – I need to get my hands on AV1 and start learning how these tools can work in my environment. What's your expert guidance for how to get up to speed on AV1 as an encoding engineer?
Debargha- When AV1 was finalized in 2018, because we were working on a very tight schedule at the time, we didn't spend too much time making it faster. Yeah, so when AV1 was first finalized, of course, there was a huge camp of anti-AV1 folks that took the software and then made some tests and found it was 1000 times slower than HEVC. And that was probably true for maybe about a month, but within a month that 1000 times became maybe 100 times, which is still unusable. But the talking point would stick and many people say, “oh, it's 1000 times slower, than HEVC what's the point?”
But then over the next six months, it became fast enough that we could actually convince YouTube to start using it. And YouTube would not use it unless the complexities were within limits. So then AV1 started being more useful. But, the initial impressions people had, unfortunately, stuck.
So, if someone really wants to explore a new codec, whether it's AV1 or EVC, or VVC, or anything that they want to explore, you have to look at the software and run it, but then whenever you have difficulty and you're seeing that the results are not matching what you were expecting or what the proponents have been telling you you should get, the best thing is to actually get in touch with the actual people who are developing the software to see what am I doing wrong.
I have seen lots of tests after AV1, more after VP9 was released, but also after AV1, that showed comparisons between AV1 and let's say VVC, or JVET and HM software, but they used certain parameters that are completely wrong.
So if you use parameters that are wrong for AV1 you'll get really horrible results. So they'd be using parameters, which would turn off all out-of-order encoding, inadvertently. And then that would show like 30% worse than HM. Which is obviously not the right thing to do. So, but at that time, I would suggest getting in touch with us and sending us your command line, so we can see what you're doing wrong and all that. And of course, now five years after AV1 or four years after AV1 has been released the documentation is much better in the code base. So I think the software itself will have a lot more information.
Mark- Yeah, that's a really good suggestion you know, when you're dealing with a new standard, get in touch with somebody who worked on it. I've worked around standards bodies for a long time and what I’ve seen is that everyone's happy to help, because we have a vested interest. We wanna see what we worked on to succeed. And if there's misinformation or if we can help guide people on the right path, we want to do that.
Debargha, thank you so much. This has been an amazing discussion and I am quite confident you're gonna be back, and we will talk again. We'll find another interesting angle and really, I appreciate your time. So thank you.
Debargh- Yeah. Thanks, Mark for having me. And I really enjoyed my time here also.