TVV EP 19 - Exploring Versatile Video Coding

In this episode we are joined by Benjamin Bross and Adam Wiekowski, who head the Video Coding Systems Group at the prestigious Fraunhofer HHI research institute. Adam and Benjamin have been central to the development of the latest MPEG video coding standard Versatile Video Coding. They explain VVC's guiding philosophy and what "Versatile" really means, and what it takes to go through the standardization process. We discuss the challenges of developing the reference codebase for such a complex standard, what they aim to achieve with their open source encoder VVenc, and the novel adaptive streaming techniques that VVC enables. All this, and how ready are AI tools for mainstream codec standards, really?

Watch on YouTube

Welcome to "VideoVerse".

Zoe: Okay, so hi, everybody and welcome to the VideoVerse. So today, we're going to have a new episode. And first, this is Zoe from Visionular. And then, for this episode, I have Thomas who joins me from London. So he will be the co-host for this episode. So every episode is special, but for this one, I would say, this is really special, that we invite Benjamin and Adam to join us today. I think, mostly, we are going to talk about some video compression and codec stuff. I will like Benjamin and Adam to introduce themselves. And as far as I know, they are joining this episode from Berlin. So this time, I really appreciate this opportunity. The four of us together, we're going to have some good stuff to talk about. And so, now, Benjamin and Adam?

Benjamin: Thanks, Zoe, for the invite. We're very honored. My name is Benjamin, next to me is Adam, and we are heading the video-coding systems group in Fraunhofer HHI, and we focus on implementing video-coding standards, where Fraunhofer is are also contributing. That's where I started in HEVC times, where I first met Thomas, also, when he was back at VVC, and we working on HEVC, and then Adam came in right before VVC.

Adam: Yeah, I came six years back and we were just getting started with VVC at HHI, and I've been working for a few years trying to develop technology for VVC for the standard itself. And since a few years, I've been mostly doing implementation projects on VVC.

So it's funny, for HEVC you always look for, in standardization, for a software basis, and for HEVC, this was from, I guess, Thomas, correct me if I'm wrong, from Samsung, VVC was the starting base. And from VVC, it was the software base that actually Adam was working on that we proposed.

Thomas: Oh, cool.

Zoe: Yeah, I just want to actually mention. It's very interesting, because you two just mentioned HEVC and VVC, and of course, there's another name for these two standards, right? So H.265 and H.266. At least, a lot of people say that's a nickname. So I'd love to learn why... I know these have two names because of the two codec organization behind the scene, but talk a little bit more about that. The HEVC, I think, for me, I only knew that it was finalized in 2013, because that was also the year VP9 was finalized, and so VVC was finalized in 2020. So I knew that because that's two years after AV1's finalized, and then people said too, "Oh, now VVC came out and this some advantage done in their line. So you can do more because you have, including Thomas, right, heavily involved, the development, I will say, the inventor to come finally push this standard to come into being.

Thomas: So perhaps you could talk a bit about what VVC adds on top of HEVC and how they're related. They seem to have quite a similar basic core, but there's lots of things that have been added in the VVC.

[00:03:53 Relation between VVC and HEVC]

Benjamin: Yeah, yeah, maybe to get everyone watching and listening on board. The core of all, be it HEVC, VP9, VVC, AV1 is a hybrid video coding of prediction and transform coding of the prediction residual. But from standard to standard, where we are on from HEVC to VVC, this got more and more generalized, and of course, complicated. And so, a lot of coding tools have been modified or extended, as well as new coding tools in prediction and transform in entropy coding. But it's more an evolution than a revolution.

So the basic concept of all these codecs is still the same since H.261 and PAC1, and it's still evolved, got more and more modes, for encoders, more decisions to make, sometimes hard decisions. And yeah, for VVC, the most recent one, which we both have been involved, that's focus on versatility. So it's versatile video coding and the focus there was mainly to have a broad support for a wide range of applications, not just the usual coding efficiencies. So saving bits, for example, 50% for the same perceived quality, but also to have specific coding tools well suited to efficiently code HDR 8K videos.

So really high fidelity video, 360 degree video for VR, AR applications. We have good low-latency streaming support with the modified high-level syntax for the system layer in order to facilitate low-latency transport. We worked on reference picture re-sampling, which is a feature that enables resolution switching within a bit stream. We can talk about that later because it's a specific use case with open GOP streaming, multi-resolution adaptive streaming with open GOP. That's very exciting because this is something that people always believed cannot happen, but VVC enables that.

And the other point that is attached to that is that you have most of this covered by a single profile. There's a Main 10 profile. Typically, you have different profiles for different applications. So then, not every decoder needs to implement the whole standard, but just a specific profile. For HEVC, there has been a Main profile, a Main 10 for 10 bit. Main profile only is 8 bit. Then, later, came screen content coding extensions. I know you guys do a lot of gaming and this kind of computer-generated content. There wasn't an extension to HEVC, so a separate profile. But at that time, HEVC was deployed with the Main 10 profile.

So the decoders in the field could not make use of this nice advanced feature of screen content coding. And for VVC, everything that I mentioned, 8K screen content coding, efficient streaming, low latencies, all baked in the Main 10 profile, the only additional profile is scalability. So scalability has always been, it's been there.

Zoe: Yeah, it's always been there.

Thomas: Yeah, so I remember some interesting discussions back in the HEVC days. Because about scalability, these always get quite intense. So I'm very glad to see reference picture re-sampling in VVC. It was something that I tried to get into HEVC. And because it's captured, partly by the scalability profiles, I think there was a reluctance to do that. But I think, now, there's a recognition that video codecs have to do multiple applications. It was a problem with HEVC. It was something that AV1, actually, really tried to address and have a simplified profile, but it's really hard for hardware guys. They don't like having to support so many tools and so many kinds of things under one profile. So was it kind of difficult to get all these things and the profile discussions with the software guys and the hardware guys all on board?

Benjamin: Yeah, but that's a very good point because that... So now, we are a bit late in Berlin, one hour ahead of London, several hours ahead of Pacific time, but we are used to that because we had these standardization meetings going way beyond that hour. This is exactly that kind of discussion. But it's tedious if you develop and always have to convince people and discuss and make compromises. But at the end, I think it's a very good thing and a very strong advantage of this process because you have people that needs to implement it. They have a focus on complexity and that the chip doesn't get too expensive and it's affordable that you do not have to pay a lot of money to build the chip.

And then, we're coming mostly from academia, having these nice ideas of how to compress, which is super complicated, and this clashes with these discussions. And at the end, they say, "Well, this is the most critical thing. We work on that. And one good example, for VVC, we introduced a neural-network-based intra-prediction. So we tried to, from this hybrid classic coding approach, we tried to do the intra-prediction by a neural network, and that worked great. It saved like 4%. Maybe people not that familiar with current codec. 4% is quite a lot nowadays.

Zoe: Yes, 4% is a lot. Yes, this is a lot. 25 to 30%, that's generational difference, and that 4% as opposed to 30% is definitely more than 10% out of that gain, right? So this is very important.

Benjamin: But this neural network, on a block basis, this was really a no-go for for implementers. So they gave us really a hard time. And at the end, we tried to cut it down, remove the non-linearities. At the end, we ended up having a matrix multiplication of several pre-trade learned matrices, depending on the neighboring samples as an input to the multiplication. So this is how we, then, transformed the neural-network-based coding tool into a more or less simple matrix multiplication with offset for intra-prediction, which is trained, so still data-driven, but you don't need a neural network to implement it. And this is a very good example of how that works. But then, of course, we ended up with 1%. So it's just causal gain. And still, you could implement it way more easily than your neural network, of course.

And I mean, on the other hand, the standard is very versatile, and the name was agreed pretty early. So coming back to your initial question, it was an issue squeezing all of those things into a single profile or a few profiles, but it was a prerequisite of the development to have something that is good for all the problems or the out there more or less.

Zoe: Right, you basically just said, "Hey, we have all these applications," like this standard one we addressed, right? And just now, you mentioned there's low delay and there's a different content, including gaming, those type of screen content. You also mentioned 360 and you put all these kind of different, I would say, new use cases for video into this one single profile, and this should be good for, let's say, the hardware implementation. Because you put the one profile, they said they're going to support... They claim they support this profile, then they have to support every single thing within their profile, right?

And this is make also, instead of like HEVC, there's an extension screen content profile that not be able to supported by many HEVC players, for example, that's actually considering at least a certain usages for that standard. And we can see that with VVC definitely, at least from my perspective, is that they take that lesson to actually put VVC in a better spot. You also mentioned that... It's a great point. You mentioned the AI part is actually, at least, I think a guide to us to go to one direction showing, "Oh, there's a great potential, 4%," then you really want to get a simplified, otherwise the hardware guys we will have a say. They announce that at event, and then we rely on them, I think at least standard, right?

When the hardware start to support the standard is actually help build up the ecosystem for this standard. But then, at least, you got 1%. So talk a little bit more. What is the neural network? Why the hardware do not like that? But then, at least, the expansion that that's at least gave a way in saying that, "Okay, there's a potential." It guide us to optimize along that way, and we finally get that 1%.

Benjamin: Adam, feel free to chime in, but I'm not a hardware expert. I can only parrot what we have been told during the development. So mainly, I think a neural network is great if you have a huge amount of data, like a picture. You get it in the network and you get it out. So pre-analysis, that's a very good example, look ahead. All these tasks are related to what is around the codec, I see as a very good task for neural network. But once you're at a block level, you have these...

Thomas: It's very branchy.

Benjamin: The quadtree, you have these ternary branches, binary branches. And then, on these blocks, you have then to execute networks on these different shapes and sizes and have them on a die with all the coefficients of that network that are trained, which consume a lot of memory, and exactly, this is what we've been told. That memory is supposed to be super expensive. So it's a combination of both, of memory and then...

Adam: I mean, one more thing is the mode that we're talking about is an intramode, right? So you need the neighboring already reconstructed to run your neural network, run your inference. So this creates a big latency, right? If you input an image, get out an image, you can allow yourself a few milliseconds of latency. But if this is really block for block in a serial manner, you have a few milliseconds of latency to execute your neural network. This is too much. And every time, you have to fetch all of those coefficients and stuff, it really gets too much. It was actually funny seeing this developed, where the colleagues were bringing in this tool simplified every meeting. And basically, every meeting there was some new requirement. First, it was the neural network, then they did a matrix multiplication and it was, "All right, guys, you can put it in the VVC if you reduce the memory to this, and then to this, and then to this." But we did that, and in the end it works quite good.

Thomas: So I just wanted to ask you a bit about your software implementation experience. So you've developed your own implementation of VVC and HEVC before that, but for standardization, do you think that really helped you understand the complexity arguments, the fact that you'd actually had to implement an encoder, maybe not a decoder, but an encoder at least?

Benjamin: Yep, I can tell a little bit about HEVC, and then Adam has a lot to say about VVC. For HEVC, I just remember that, during standardization, there was this residual quadtree that splits the prediction residual in the quadtree for transform, and we really had a hard time getting the complexity down and the reference software in order to justify the gain, and there was a lot of discussion around that. Is it justified to have this tree? And the encoder and the live encoder, at that time we did an HEVC broadcast live encoder. And for that, it turned out to be super helpful because it's cheaper and way faster to split it in that domain and to restrict the prediction blocks and the size. So actually, in an actual encoder, by implementing it for a live scenario, we learned that that's beneficial and we would've wished to know that when we are still discussing and debating that in standardization. So that, in respective, it helps, and maybe in the future. And for VVC, what's the...

Adam: I mean, I came to product development from a different field. So I started with HEVC with the HM software and for me. Back then, it was the fast software compared to the state of VVC was in back then. So Ben was talking about how HEVC, how stuff used to be too complex, and for me, it was the easy encoder. So of course, you're getting to VVC, and it used to be bad. You used to have to wait days for your experiments to finish. And the stuff I was initially working on was the partitioner, and partitioning in VVC with all the new kinds of split is actually one of the most bothersome parts for the encoder. So in the end, when we were talking about the partition proposals, the discussion were kind of going the direction, who has the best way of controlling their partitioning algorithm?

So the arguments that were being brought and the standardization, they were not about the advantages of your partitions, but rather, how fast can you go through your partitions? Which of course, if you sit down for one or two months more, you're gonna come up with an even better way. So this was interesting to see. I think this was also like a new aspect in VVC. But implementation complexity was being looked on.

Benjamin: But you can see, at least, in partitioning, if you consider that, theoretically, all possible combinations of splits, you will not be able to compute that. So even for reference software and reference implementation, you need to have a fast search for that, and ideally, even shortcuts in there. But then, that's also a good example for, if you want to talk about AI encoding that we tried a lot. In literature, you find a lot of approaches that try to reduce the search space of the partitioning, and Adam also investigated that quite a lot. But since then, he's not, at least, in that codec-control domain, not a big fan of...

Zoe: Because you already mentioned the AI codec. But from the standard point of view, just as we mentioned, right? So you mentioned that there's a neural-network-based approach, at least, manifests a potential 4% gain, and finally, lay down to 1% with a lot less complexity, a lot more friendly to the hardware implementation. And that on top of it, the standard, we talk about encoder optimization. So because there's so many modes, there's already lot, actually back to Z2Z4. I still remember that. I was there to attending the first conference when I joined Nokia. So my manager told me, "Okay, you have to really tell them that HL2Z4 even, its very complicated, but then there's a potential down there. But now, nobody think 2Z4 as complicated. So now, we have VVC and 266 already.

Now, we talk about the optimization of the encoder as Benjamin mentioned. Because there's so many tools that have been developed, then how do we optimize it? And then, we notice the papers addressing that you can... When you talk about reduce the complexity, meaning that you have to make a decision for different cases and you now guide you to that decision, then I follow what just about mentioned. Adam has may have some thoughts or different opinions regarding deploying a neural network for this kind of encode optimization, reduce the complexity.

neural network optimization

Adam: So yes, just to step back a bit, you talk how people used to think H.264 as complicated. For H.265, the reference software in the partitioning, it actually still does the full search. So there is no early termination in there. So this also shows how much more complex VVC is. Because everyone knows that the reference software of VVC is, let's say, eight times slower than the software of HEVC. Even though there already are a multitude of early termination modes in the search algorithm, especially for partitioning.

And now, people sometimes say, "Well, we're gonna standardize this," and then the implementers are gonna take care of it and how it just gonna implement neural networks, because they can deal with everything. But I've been following research, I've been following papers, and what I see is that it's actually a very tough problem. So there isn't yet a really good solution to drive VVC partitioning using neural inference and still produce very good results at a lower speed. An interesting comparison that I did with HEVC, I guess.

On an encoder, you have the options to try different partitioning depths, different block sizes, but you don't have to, right? Your encoder can just say, "I'm just gonna do this one block size and not try anything else." So reducing VVC third space in that manner actually gives you a lot of different working points and creates a convex hole of different working points, at least with regards to partitioning. And I'm just running a comparison with state of the art with results from the literature. And not many of the literature results can beat this complex hole. And it's only by a tiny margin, and also this shows you the complexity of the problem. But this is also because the VTM, the VVC reference software, it's already heavily optimized with regards to the partitioning algorithm. So you're basically trying to optimize and already optimized search space, right? So it's like complex.

[00:25:27 Neural Network Optimization]

Thomas: Isn't there also kind of a chicken-and-egg problem with neural network optimization, which is you need to generate a ground truth of the actually optimum partitions, but that's enormously complex and you can't generate enough data to correctly train your neural network. So you have a huge kind of bootstrapping problem that you need to make it a bit faster, then optimize again, then a bit faster and a bit faster, and maybe that will take a long time for the people to get to a good neural network solution, 'cause you can't get the data until you're already fast.

Benjamin: Well, that's a very good point. Because probably, these networks have been trained in an already-reduced optimized search space. So they are losing maybe a lot of options that are already excluded. That's a good point.

Adam: And also, it's very hard to model the temporal dependencies. For all intra coding, there are actually solutions that are very good. But if you go into motion-compensated coding, this really becomes a very tough problem.

Zoe: Right, we think, at least, a neural network that see that provide another angle to optimize, first, with the standard, and then optimize the encoder. I think just, as what was mentioned now here, is could we have taken . Because the neural network, first, is a structure design, and then a very important part of the whole approach is you have to collect enough training data and to light the new state of the neural network, stay in the good status in order to have good result. But besides that, just now you mentioned, not only your team are working on the codec sender, is that you work on more on the encoder optimization, provide something, I believe, is going to be open source, and open-source community will always value that. Even us, we provide a commercial solution, but we always say that we actually stand on top of the shoulders of giants and for the climb up. So we'd like to hear a little bit more about your encoder effort, optimization-wise.

Benjamin: Maybe it's good to start with the motivation, why we did that. Because we had that, some kind of a core library that we licensed for broadcast appliances. And now, we have an offline encoder and a software decoder that's open source on GitHub. So why we did that. So since we participated, as a main contributor, in the development as a standard, our main goal is to foster the deployment of a standard, which...

Zoe: Sorry, I don't want disturb you, but Adam just mentioned the reference software is eight times slower than with VVC. Right, okay, go ahead.

Benjamin: Yeah, and HEVC reference software was already super slow. It's eight times super slow. But it provides a very good coding efficiency. So that's apart from encoder-

Zoe: That's the benchmark.

Benjamin: Apart from encoder agnostic stuff, like some adaptive quantization and what you can apply to every codec, it's that you can add on top. So we want to keep that. So we do not want to sacrifice on the high-coding efficiency of the reference software, but at the same time, wants to have run times for the encoder that are usable, let's say, a couple of frames per second. It's not live, but you can encode a catalog of videos in a decent amount of time. And also, there are a lot of comparisons out there. A lot of people want to compare codec and standards, and then there's a lot of confusion when they compare codecs. And actually, they say the standard is it. And we observed that in HEVC, so people took the open source, for example, X.265, widely used, but it's not providing the full potential of HEVC that you would get, for example, by a commercial encoder. But in every comparison, these openly available encoder are used.

Of course, they're available. People can have them at hand. They can use them. And so we thought that it's a bit misrepresenting the potential of the standard, because in these comparisons... So HEVC, X.265, they are used like interchangeable, and it's always hard to teach people the difference between a standard, which is just defining the syntax, and an encoder implementation which can be super efficient or super bad, or in the middle, or it can be fast but not that efficient. So the encoder is really application-specific, and that's why we want it for VVC, coming back to the original motivation, to have something that is openly available for everyone but providing the best efficiency or keeping the high efficiency, and then providing some trade-off points. When you want to go faster in coding, you can sacrifice a bit, but you are never less efficient than the HEBC reference software.

So you're always better than the HEBC reference software. This was our motivation too. And so far, we are happy. In all comparisons that people tested, we are always the best performing of resource coding, by different margins, depending on the conditions, depending on the contents, of course, but at least we are always best ranked. So this was the goal.

Zoe: Right, you make that very clear. So here is HEVC and then there is a VVC, right? And then, the HEVC software, as you mentioned, if you really want to see the largest possible coding efficiency, that you run the reference software, but it's really slow. As you're right, many times, actually, X.265 has been regarded as a representative of the potential of HEVC, but we had to respect that because X.265 as open source software has actually helped build the ecosystem. When people want to learn at least the potential of HEVC compared to other, for example, AVC was some other standard. They actually did show there's a potential down there with a new standard, but now, the full potential.

So then, you move to VVC, and said, "Okay, now we have new standard, the potential even larger, but then how people will see it? If it's even slower, a lot of people will just say, "Okay, no, it's very slow," then how are we going to do that? So basically, we understand you wanna have an even faster, open-source encoder to manifest the potential, but it's always at least as good as its the predecessor the HEVC. So at least, you show there's a potential, and even larger if you can tolerate more speed-wise.

Adam: So the way to look at it, it's how much efficiency can you get for a given runtime, for a given resource utilization? And this is what we're trying to keep. Against any other open-source encoder, if you have the same runtime with the same resources, we want to have better efficiency. That's how we see it. So we want our fastest modes to be at least as fast, and maybe even faster, than some of the X.265 modes. We want to have comparable working points to other encoders, but always keep the efficiency better. So for X.265, if you go to the faster modes, you also give up some efficiency, right? That's how you look at the curve. And for a given runtime, we really can keep the efficiency improvement of 40, 50%. I'm very proud of this result.

Zoe: Very proud. So basically, we talk about speed now, like video quality, then here is code efficiency about the bit rate or file size, and then here's encoding tab, or encoding speed, right? So they said, "Hey, let's fix this now. This is all running at a similar encoding speed." And then, let's say, about the quality and bit rate, this is all together contribute to coding efficiency performance. They said, "Okay, when we fix this now, and then you are encoder manifesting VVC, always show at least the better one compared to the other baselines.

Thomas: So I was just gonna say, you are trying to show also the actual gain from the syntax itself. If you constrain the complexity, if you think of, say, VVC as being a kind of superset of HEVC, you should always be able to do better than HEVC for the same footprint, because at the very worst, you could just do an HEVC-like implementation inside it, but you've got lots of other options. So you should be able to do better, either faster or better quality, or both.

Adam: But that's a very interesting point. Because we have our presets that we have, and our fastest preset, it still doesn't allow live applications, but just because of what you mentioned, we decided, for now, not to pursue faster presets, because then, you gonna go into this territory, where you basically only have the modes that you will also have with HEVC. So why go there then?

Benjamin: But it's also part of the software structure that limits the speed because it's still derived from the reference software. But I wanted to say this is the encoder, but you mentioned the... Related to that, maybe it's good to separate two kind of encoder tools. One are the ones that require a search, and so you do a search, and then you signal the result, or you apply a certain tool always to increase the fidelity or to do a search that you do not signal, but you do the same search at the decoder. We have some small decoder searches in VVC.

Because if there's no signaling, there is no choice to make, right? So the encoder and the decoder goes to the same.

So this is maybe a very interesting if people are not that familiar with encoders. So you have things that an encoder can search, and then once it finds the optimum, it signals it or it just does it without signaling, but the decoder needs to perform the same. And this actually has an impact on power consumption and runtime, so you can shift a lot of these encoder searches to the decoder searches, but then the decoder takes longer and consumes more power, simplified-speaking.

Thomas: And there's some really big issues about that now. Because I think, for example, the EU is wanting to constrain the amount of power that televisions can use, and having all these hungry decoder chips is a big problem. So you really care about the complexity of the decoder and the complexity of tools that an encoder can't avoid. These are the most important things to optimize because you can always do a reduced search at an encoder through some clever ideas. But if you can't avoid this complexity, it hits the whole standard and the whole ecosystem.

Adam: So here's the interesting thing about UVC. So this algorithmic complexity, you can see it in the decoder complexity increase, which is around two times more complex than the HEVC, which is actually very well kept in bounds, right, for double the compression efficiency. And we found two sides of it. So we found, for the encoder, that it's actually more efficient to take this additional algorithmic complexity and then reduce the search space. So basically, use those implicit modes instead of trying to search through modes that you have to signal. So this is good for the encoder, but of course, the decoder also has to do the work. So what we also did, we tried to find an encoder setting that would decrease the algorithmic complexity. So the runtime, so the power consumption, on the decoder side, which is funny, because we can keep most of the compression performance, but of course, there has to be a trade off, and the trade off is that the encoder needs little bit more time to calculate this compression. So you can really shift the complexity.

Benjamin: So consider the scenario, you encode a title for video on-demand and it takes twice as long with that special low-decoder energy mode. You do that once, but then it's watched and played and streamed, depending on the popularity of that title, to a lot of devices that consume half the power. So this is the kind of trade off you have to keep in mind also when doing encoding and designing your encoder framework and your application too. Maybe it's, sometimes, a good idea to spend more effort and more complexity on the encoder side in order to save some at the decoder end.

Zoe: Right, so you basically mentioned that, when you do the encoder optimization, you also take into account the decoder side, the complexity and the VVC search, and at least quite certain choices that the encoder can make, right? And I also mentioned, based on the use cases, if the value use cases and encoder just once, and then there's a lot of decoder opportunities because people watch one movies for millions of times and that movie need only be encoded a very few times. But then, also, there's some other use cases like live, for example, and then you keep encoding, and at the other end, you keep decoding. And then, in that sense, then the overall, like we all mentioned, the energy, the power cost, whether we should go green, then we need to also another use cases guide how we optimize the encoder.

Adam: But that said, I just want to mention, from the experience with the VVdeC project, with our decoder project, VVC even, in its full complexity, is not all that bad. You can totally do software decoding on mobile devices, with our software, up to HD. So this is a thing that we can do, but you can easily use full potential of VVC and it's still very reasonable.

Zoe: You already mentioned the maximum is twice. It's only twice as complicated as HEVC, right? But then, the overall code efficiency, at least, you mentioned that it's quite big, 30 to 50%, compared to the potential, VVC compared to HEVC. And I just say... Go ahead.

Adam: This is the savings that we can get from just encoding a sequence and comparing, but with the additional tools that VVC has, we can have additional gains. So one of the things that we actually worked on was to enable continuous predictions, so called open GOP, in DASH streaming, right? So you have your segments, your chains, and we developed a method that you can use in VVC because of the RPR, that allows you to not have prediction breaks between the segments. Benjamin, maybe you wanna elaborate on that?

[00:42:39 A method of using VVC for not having prediction breaks]

Zoe: Let's talk about that.

Thomas: So that sounds really interesting.

Benjamin: Thomas, you already suggest proposed that for HEVC, using this part from the scalability feature as that...

Thomas: Yeah, so the idea, I guess, is the same, that you want to change resolution, maybe, to optimize your quality for a given bit rate, and you won't maybe then want to move resolution up or down, but you don't want to send a key frame, an IDR frame, and restart the whole bit stream. So you need to be able to predict across resolutions. So that's something that, from the video conferencing side, I've always been interested in, and it's something that AV1 has for video conferencing. So is it, basically, the same idea in VVC?

Benjamin: Yeah, so you have two aspects here. One aspect is that you have that at the encoder. So for example, video conferencing, you have an encoder that adapts the resolution while encoding. Also, if you do, I guess, this adaptive... What's this called? Adaptive resolution, dynamic resolution, whatever you call it. Different vendors have different names for that. So if you have content that does not have a lot of high frequencies, it does not hurt to lower the resolution, it's like this convex hull optimization. Sometimes, you get a bigger bang for the buck if you lower the resolution and spend bits at a lower resolution, if you don't have so much detail. So you can do that, but the encoder is aware of that and does that.

The second thing that we will also showcase at IBC, you can see that in the demo. It's also not the live use case, but it's the adaptive streaming use case, where you have several encoder separate from each other, creating different renditions, for example, for DASH. So one low bit rate, low resolution, medium bid rate, low resolution, medium resolution, high resolution, high resolution, high bit rate. So you have different renditions of rates and resolutions, and then the client, depending on the network conditions in adaptive streaming, can switch to a lower or higher resolution, for example, depending on the network conditions. But then, you have a stream from an encoder, let's say in UHD, and a stream from a separate independent encoder in HD, and then the client starts decoding the pictures in HD, has the HD pictures in the decoded picture buffer, and then switches to the UHD rendition.

And then, the UHD pictures can refer to the HD pictures that are still in the decoded picture buffer because they are up-sampled, using this reference picture resampling. However, if you consider encoder/decoder, there's a drift because the encoder was encoding it, using the original resolution while now the decoder uses the up-sampled resolution. So the decoder uses different data than the encoder used to do these decisions. So there's a drift, and this can, depending on the coding tool, can cause artifacts. So it works, but because you have that encoder-decoder drift, it can cause artifacts for a certain set of specific tools that are sensitive to that resolution change. And you can constrain, and there's a way in VVC to signal that, it's called constraint razzle encoding. For the high-level syntax aficionados, they know what the razzle is, it's a specific type of picture at the switching point. And these constrained razzle pictures, you need to constraint that small group of pictures when it switches to not use these tools that are problematic.

And then, you don't have artifacts anymore because the resolution switch and upsampling referencing a different resolution than the encoder used to encode, it is not a problem anymore. So this gives you the advantage of making use of previous pictures, so not breaking the boundary at a DASH segment. And at the same time, it also smoothened the transition. So if you have a hard cut, no referencing, you can see it jumps sometimes, if you switch to resolutions. You see that, in streaming, it gets sharper or it gets blurrier or more blocky at a certain point when it switches. And by using this feature, of course, you need to apply these constraints at the encoder, but these can be signaled, so the decoder or the client knows that it's safe to switch resolutions because they're constrained. Then, you have a very smooth transition. And on top of that, Adam reminded me, you have another advantage when you do not switch.

Adam: Exactly. Because I mean, when you do video streaming, how stable is your network? Usually, fairly stable, right, at a specific level. So I think, most of the time, you do not switch, so you just decode the open GOP bit stream as it was intended to be decoded. Sometimes, your conditions would change, and then you have to change the rendition, and it's only at the changes where we have this special sauce that causes the drift to be constrained. But usually, this actually works without any drift. And I think this is, actually, where the additional, where additional gains are coming from. So using this technique, of course, depending on your segment size, so depending on how many switches you have, a VVC can save you additional 10, 20% BD-rate, just because you don't have to start an independent bit stream in every segment.

Zoe: Right, so in these use cases, basically, because of the allowing of the switch, is there actually potential down there, right? Big potential down there. And to have the VVC-enabled resolution switch without having to restart the screen, and then that saves... We all know that. Otherwise, you have to put a synchronization frame, which is the intra frame. That takes a lot of bits and a lot of bandwidth.

Adam: Just for clarity, you still have to put the intra frame, right, because you want to be able to start the video at that segment. It's only the frames that proceed the intra frame, they can use the the last references from the previous frame.

Zoe: Got it.

Benjamin: In short, you can say, technically, this is very simple. So the implementation of that, and because it's the encoder does not need to do that. You just run separate encoders, and then the client does it. And if the client is VVC-compliant, it does it, so you don't need to implement it additionally. So we are here but having one profile solving it all. The most challenge is getting, that open GOP is not possible with resolution switching out of the mind of the systems people. This is the great challenge there.

Zoe: Okay, so you're basically showing that this is already possible and this is already being implemented by the proposal of the open GOP concept provided by VVC?

Adam: Well, our software, it can produce... We have this refresh type that can produce bit streams that are compatible with that, and I think all the open source software out there, I think GPAC is able to into DASH streams that just work. if you have a compliant player, it just works.

Benjamin: So we also have an integration of our decoder in the ExoPlayer for Android, and Bitmovin has shown a demo of that at NAB. and even in that ExoPlayer integration, you can send the bit stream switches with open GOP. It works out of this framework. So it's not just in our controlled environment, it's also interoperable with what's out in, at least, the open-source field and what what companies are using. So this was actually great to see, that that it is actually versatile. It's working with also other devices and software components.

Zoe: All right, so this has basically integrated or enabled by the possibility of the standard and then integrated to the real cases and is interoperable. So I do have a question before we closing this episode, because I believe, including Thomas down here, working from standard of originally H.264/AVC, even back to the early stage, then we have the HEVC and followed by VVC, and then I believe your team right now, not only trying to actually implement this open-source encoder to manifest the largest potential of the new standard, you also may be getting involved in the new standard effort, even though that has not been officially, we believe, official proposal calling for new standards effort. So I think the audience also also want to know that, what is driving not only the development of the standards, but also your team? Have you all working on standards codec and encoder optimizations, is it all because there's some mission or value that you try to chase after?

[00:53:09 Mission for codec and encoder optimizations]

Benjamin: Makes sense. I guess that's really the main point because it has, really, a global impact, and I guess, like right now we're using it. I guess, everyone uses it on a daily basis. You try to send text messages, now you always include photos that are moving, live photo you call it, or it's basically video. So it's everywhere. You have these reels, these short stories. So video is really everywhere. And by reducing the amount of data there, even if it's just like a tiny little tool somewhere in a standard, you somehow contribute to something that is used globally on a global scale. And I guess this is one of the big motivation of everyone here, at least I can speak for our department. And for the development of the codec, since you mentioned our VV encoder and also the VVdeC decoder.

Also, the open source that put the code on GitHub, have patches for Ffmpeg available there. That's also big motivation because people and our colleagues seeing it used actually and people have access to it, they make very good things out of that, integrate it in systems. So it's really good to see that something has a global impact. Is that also true for your motivations?

Adam: I mean, the thing is codec development, it's a very... You would know, right? It's a very interesting field. There are so many things to explore. Each standard has it's own thing, and we're a research institute. And we've capitalized, as a research institute, very much on this development, like doing many publications and trying out new stuff. So in that sense, it's also just an interesting research project, in addition to, of course, being an interesting engineering project, and so on.

Benjamin: And it's also good that, they brought the CABAC entropy coder to H.264. So if we have a question in that domain, we can just go next door and ask . They're here next door. So it's really great to have all these people with so much experience in the field right next door. So this is really a great opportunity, also, and makes us a lot of fun and it's a big motivation to work here.

Zoe: Yeah, I think you addressed quite a bit stuff down there. So not only that first video is everywhere, as mentioned. And then, whatever you did on a daily basis actually has a big impact and you can also feeling, I think there's a pride feeling down there, and also there's lot of, you just mentioned the next door, and there's a lot of brilliant efforts down there. And I think even now, there's still quite challenging. I see sometimes the quality, at least, here is not as ideal as we expect it. So there's always lot of things that we can put there.

Benjamin: It's so true.

Zoe: Another thing I actually learned is, sometimes, it's a mission for us, or for your team, to actually show the potential, right? Sometimes, people, either technicians or the audience or the end consumers, they take something as granted and... For example, text is always instant to be distributed and video supposed to be always, I see that, whatever, it just give me the clockwise kind of progress because video has a big amount of data. So I was also thinking, why video just taken for granted? It has to be taking time to distribute and communicate with each other. Why not everywhere? It doesn't have to be that way, even though it's volume is small, but there's always way for other scheme that we can resolve it.

So I really feel very honored, and that this is a great time for us, all seated together, talking about the videos and compressions. Because with the compressions, we can actually all focus on delivering the ideal or best possible. We will experience it to every single user around the world, and then to integrate them, we can feel more motivated to move to the next step. And we really thank you, Ben and Adam, to come to our episode.

This is also the first time for me, at least, to get into the VVC. I see there's a lot of value down there. I want this is that, at least for technologies, VVC actually put up a lot more possibilities and potentials, and this is actually worth everybody to be aware of that. And thanks for Thomas. And also, I think it's a good time for you to record some of the old times during the HEVC standard. So we're going to close this episode, and thanks for everyone, and then this is a great time actually. I really enjoyed it. Thank you.

Thank you.

Thanks a lot.

Benjamin: Thanks for having us. It was really a fun talk.

All right, bye.

[Thomas] Bye.

Thank you, bye-bye.

The VideoVerse

TVV EP 19 - Exploring Versatile Video Coding

Listen to this podcast on