TVV EP 23 - Adventures with AV1 for RTC with Eric Sun

In this episode, Eric Sun gives a deep dive into Meta's recent experiences with deploying AV1 for real-time communications across Meta's applications. He describes what proved to be the big advantages of AV1, where Meta saw most benefit in terms of user experience and how Meta demonstrated improved engagement from better video quality, and the practical challenges that he and his team surmounted.

Watch on YouTube!

[Announcer] Welcome to "The VideoVerse."

Zoe: Hi, everyone, this is Zoe. And welcome to "The VideoVerse" podcast. So I'm the host for this episode. And I have Thomas, our team member, join me from London. Hi, Thomas.

Thomas: Hi, there.

Zoe: Yeah, as the co-host of this episode. For this one, we actually invite Eric Sun from Meta, the Remote Presence team to join us as a guest speaker. Hi, Eric, I want you to actually introduce yourself to our audience. All right.

Eric: Okay. Hi, my name's Eric or Yu-Chen Sun, and I'm from Meta Remote Presence Video team. And I'm mainly working on improving video quality in RTC of the Meta family of apps, so including, like, codec and the video pipeline in our RTC infrastructure. And before joining Meta, I worked with Alibaba Cloud. At that time, I working on, I was working on cloud transcoding. And before that, I was with Qualcomm and the MediaTek, and I participant MPEG Standard meeting, and I developed some coding tool and pushed that to be part of standard to improve the coding, codec performance. So yeah, so that's all of me. So my personal interest is mainly on the video quality, video pipeline, and the codec, yeah.

Zoe: Oh, well, thank you for joining us. So basically, I'm just wondering, so you're currently with the Meta team and then doing some real-time communication support from the video side. So what kind of Meta apps you're currently, your team are supporting?

[00:02:13 What Meta Apps the team are supporting]

Eric: Okay, we are infrastructure team. So all apps from Meta have RTC feature likely use our infrastructure including, like, Messenger is a main app. And also, actually we also have video calling feature in Instagram. And last year, we also launched the calling feature on the Facebook main app. Like before, when you, like, call a friend or receive a call from Facebook main app, I mean, on the mobile device, and we have, like, the phone will call a separate app, Messenger, to start the call. But right now, we integrate the, like, the calling function to main apps. We call Facebook app as well. So it's also use our infrastructure.

So that's the most typical use case use our infra. But in addition, we also have a lot of feature using our infrastructure. One interesting application to share is that, you know, like Facebook, okay, like, we are focusing on improving the metaverse, VR, AR product, and user appearance. And for those feature, if there's some feature need real-time communication, I mean, low-latency video is also use our infra. And one interesting example is that we call is Remote Desktop. And the idea of the app is that when you wear a VR device, you can work in the virtual room. Like, I mean, like you type your computer, type on your keyboard, but actually, the screen is a virtual screen in the virtual world while you are wearing the VR device. And the layers of communication, like video real time, like real-time screen sharing in ultra-low latency from your laptop to your device, and then we call that product called Remote Desktop. And it's also use our infrastructure. So yeah.

Zoe: Oh, I see. What about if I can ask WhatsApp, are you also supporting?

Eric: Oh, that's interesting. Yeah, okay, that's part, yes. Well, we work closely with WhatsApp. And yeah, so, but right now, like the user space, you can see, like, the user ID or user space is different. So right now, we work closely, but WhatsApp is the only application use another infra right now. Yeah, but we work closely.

Zoe: Got it, because definitely a target. For example, from our side as the users, we notice that Messenger and WhatsApp seems like target on different user groups.

Eric: Yeah.

Zoe: And so I think we invite you, there's one reason. And I also mentioned that last week, November 30th, last Thursday, actually there was a big event organized by Meta, and then there's also a lot of other people joined Video @Scale. And then I also watched it. And it's a great event. There's a lot of good talks, actually sharing from the Meta engineers and some other engineers from other parties on sharing their experiences. And also, we feel the market trend in the video technologies. And you gave a talk down there. So we invited also just we can talk a little bit. So at the beginning, maybe you can have a brief intro about your presentation for the last week's Video @Scale.

[00:06:10 Brief intro about Eric presentation on Video@Scale]

Eric: Okay, yeah, sure. Yeah, so yeah, I gave presentation in the @Scale event that week. Like, in my presentation, I'm sharing if our experience adopting AV1 in our infrastructure. So you know, like, AV1 is a new codec. And like right now, before we upgrading our codec, we use 264. So we work, right, recently we're working on upgrading our codec to AV1. So yeah, so that's what I present in the Video@Scale event. And in the presentation, like, I shared some result from, like, the benefit of adopting AV1 and also, as well as some challenges we face. So maybe, like, I can quickly overview what I present at that event.

Like, for sure, like, we see several benefit on Messenger app and when we adopt upgrading to AV1. Like, the first benefit is that better video quality because AV1 has better video compression, so better video quality. So that's expected. But I shared some demo, video demo, like video comparison in my presentation, and compared quality between AV1 and 264. And it's very impressive, like, the quality difference is really large. Especially one thing I want to mention is that we found that we compared the quality at a different bitrate. And for example, like 100 kbps and extremely low bitrate, like 55 kbps.

Zoe: That's pretty low.

Eric: Yeah, very low, they're very low. Yeah, and it's very surprising that we see the low bitrate actually, the quality difference is more obvious. That's really good. Like before, I expect that the AV1 provided a better quality on the high bitrate because there's a lot of, like, coding to optimize for the high resolution, like large block size and different block partition, something like that. So before, I'm not sure whether we can have good quality or good quality improvement for the low bitrate, especially on RTC app. Usually, the resolution is not that large as a VOD or movies, yeah.

But yeah, I'm happy to see that actually we get good benefit or quality improvement for the low bitrate. So the reason why we have, we can, I guess, like, the reason why we see the big improvement on the low bitrate is that because AV1, in addition to the more advanced coding tool, AV1 also has some very useful new function, functionality. And the one tool I want to highlight is the, we call that, we used to call that the Reference Picture Resampling, but I know that in VP9 or AV1, there's another name for that tool, but I forgot the exact name. But the idea is that it enable inter-prediction between the frame with different resolution. So we found that that is extremely important for RTC especially for the-

Zoe: Is that the reference frame, Reference Frame Resolution Adaptation?

Eric: Yeah, that could be the name because I'm with MPEG before. In MPEG meeting, we call that RPR, Reference Picture Resampling. But when I talk to AOM guys, I noticed that on AOM, the name is a little bit different, but the concept is the same. Yeah, yeah.
Yeah, so the thing is that like, because to that tool enable the sender or encoder be able to quick-adapt to the network. Like, one thing, one challenge we have to deal with is that being able to deal with the, like, two issue when we talking about adapt to network, two issue is important here. One is changing bandwidth and the second is the packet loss. And in order to handle the changing bandwidth, usually sender has to adjust the resolution and the RPS during a call.

But the challenge here is that for most codec, a changing in resolution usually means that the encoder has to encode a keyframe, but encoding a keyframe would introduce a bitrate spike.

Yeah, the keyframe. Like, we're codec guys, so we know that. Like, the keyframe usually, the size is larger than P-frame. We can make it as embedded, then there will be some, like, a quality inconsistent between keyframe and P-frame. So it's a very, like, it's some fundamental challenging issue, like, how to do rate control between keyframe and the inter-frame, right?

So that's always a challenge for us, so. And we do have some, we did some study. And yeah, the number of the keyframe generated because of changing in bandwidth is not trivial in our product. So that's always an issue for us. And likely, like, the more keyframe and the likely more video freeze on the videos. We know there's a pacer in the middle, so maybe the issue can be mitigated, but it's, like, having keyframe is still a challenge, you see, especially on the, like, low bandwidth or like, picky, not stable network. And then back to the AV1, like, that coding tool enable the interpretation between the frame with different resolution, it's super useful for us.

So, like, we can have less overshoot and also better quality in terms, for sure, like, P-frame, not for sure, usually the P-frame, the quality is better than the keyframe because we can have inter-prediction. So that's another, like, benefit we observed from adopting AV1, especially useful for RTC.

And the third one we observed is that the screen sharing. Yeah, screen sharing is very important use case for us, like, in video call, usually we will share screen. And especially, like, there's some app. I just mentioned like remote desktop is a pure screen sharing.

Zoe: That's right, that you mentioned even with the Quest, right? And then VRs, and then you also have that kind of, like, virtual desktop inner view.

Eric: Yeah, that true. Yeah, so screen sharing content, like, screen sharing is an important scenario for us. Like, yeah, we did some study on, like, even for the typical Messenger call, the traffic from screen sharing is not trivial. So which means that, like, we do have a lot of user using screen sharing in our call, so that's some scenario we care about.

But screen sharing challenges that, like, the keyframe of screen sharing is usually much larger than P-frame. And it's even larger than the normal, like, the keyframe of normal video, I mean, camera video because there's a lot of sharp edge from text content on screen sharing. And yeah, so it's very challenging for us, so. And also, like, screen sharing, we care about the latency. So which means that if you have a large keyframe, either you use rate control to restrict a keyframe size over you have to accept some latency.

And especially, like, the latency's not just from, like, starting of the call, but also, like, when there's a packet loss and the lag doesn't work, then you have to retransmit the keyframe, then there's a delay and delay from the keyframe, large-size keyframe, and it is very annoying. So yeah, so. And back to AV1, AV1 has two screen sharing coding tool: palette mode and Intra Block copy, which can significantly reduce keyframe size. So that's also super useful for our use case. And one thing I feel is worthwhile to mention here is that if we check, like, compare two coding to a palette mode and IBC, and usually IBC in terms of BD, right?

Zoe: So basically, one is palette mode, the other is Intra Block Coding. Short is IBC, right?

Eric: Yeah, yeah, yeah, thank you. Yeah, so yeah. I mean, Intra Block Copy, IBC, yeah. So if we compare the PSNR or bitrate saving from low to O. And the IBC can provide a larger block bit saving than palette mode. And, like, some rough number in my mind is that for some text content, the IBC can roughly, like, save the bitrate for 40 to 50%, but while the palette mode can save maybe half of them, like 20% to 30%. So, like, the absolute number is smaller, like IBC has better improvement than palette mode. So the straightforward idea is that okay, if we only can choose one coding tool, maybe we should choose IBC. Yeah, but Intra Block Copy.

But it's interesting that in our appearance, we found that actually we want to have both.
But if we have to choose only one, I, personally, from our product experience, we prefer palette mode. It's because we've observed that instead of checking, instead of the absolute number, but actually palette mode can preserve, like, improve the perceptual quality better than IBC because it's by nature because IBC's design concept is that it's kind of, like, describe the major color, and they use major color to describe the pixel. So by nature, the tool can preserve sharp edge of the texts.

So although the quantization, like, IBC also do, I mean, palette mode also has some distortion and like, compressed data and have some distortion, but the distortion from palette mode is that maybe the color will change because we quantize the color to, like, few major color instead over four spectrum of color, but at least one, as long as we have enough major color, then the text edge can be preserved.

Zoe: Can be well-preserved by that, yeah.

Eric: Yeah, well-preserved. And that's especially good for the text content because, like, if we have some color distortion, yeah, it's annoying, I know, but instead of having, like, the blur edge or some text cannot be recognized, then user prefer that have sharp edge. Like, although there may be some color, if you want to choose one distortion, I found that our user prefer have, like, color distortion but can have, like, sharp edge because they can recognize the text. So that-

Zoe: So for text content, they prefer like sharpening edges that preserving the true color. That is the user's feedback, yeah.

Eric: Yeah, yeah, that's what I want to say. So yeah, so. And for AV1, it's perfect that the baseline profile supposes both of screen content tool because like we know in HEVC, we also has these two coding tool. But the problem for us is that those tool is in the screen content coding profile. And so the support, the hardware supported may be an issue for us because, like, some hardware vendor may tend, may support the baseline profile, but both screen content coding profile is not always be supported, supported for the hardware chip.

And AV1 still, like, the hardware support for AV1 is still getting more, like, still ongoing. Like, there's more not old device support hardware AV1. But I do see more and more hardware vendor announce their AV1 support. And the good thing is that as long, right now we make sure that those two screen content coding tool is required in the baseline profile. It's good for us as long as, like, a device, like, more and more device support hardware AV1, then we can make sure that those two important coding tool for RTC will be supported for the hardware codec by default. So yeah, but I'm talking about that yeah, sorry.

Thomas: So one thing I wanted to ask you about, just to bring you back to your initial breakdown of the reasons why you investigated AV1 is that you're very interested in low bitrate and resolution adaptation. And I remember when, in my previous role, when I was at at Webex and we were doing RTC coding, there was often a debate about what a new codec should do. Should it provide improved quality and higher resolution very often or should it allow you to reduce the bitrate? How do you see the emphasis between the two? Well, being able to use lower bitrate might mean you could use 1080p instead of 720p. Is the emphasis within Meta much more on supporting mobile applications where you want to, where you very often have much lower bitrates and more difficult connections?

Eric: Okay, so I think it depends, like, we do have a different user category, like, there's a high-end user and the low-end devices, so. And I may not be able to share, like, the distribution of the user, but I can say that, like, we care about both low end and high end. And we do have another, we also have, like, a lot of project focusing on improving high-end device. For example, like, we are pushing the high-resolution video call as well. So and that's another challenge, like, for high, to enabling high-resolution video call, the more challenge is that we have to make sure that the call is capable of using high resolution. So, like, there's a lot of interaction between video pipeline setting and the BWE, something like that. But in short, yeah, we do care about both. We do have a use case for high end.

Zoe: Right, so basically, you care on the whole range, right? The high end as well as low end as well. Even though you just now mentioned that AV1 present a good performance in the lower end, like, low bitrate and low bandwidth, relatively lower quality.But then on the high-end side, you also care about the quality down there and push for high-quality RTC communications.

Eric: Yeah, yeah, and another thing is that yes, in addition to typical video calling, again, like, we also have some special use case. For example, like, again, like, back to that remote desktop case, actually the network is like the local network. So the network condition is very good. And in that case, we definitely need, like, very high-end video because, like, if we target for the virtual screen. So, like, we even looking for is that possible we can share the 4K screen with extremely low latency. And in that case, it's also challenging, like, before we saw that, okay, it's perfect network. So maybe the codec, the codec efficiency is not that important because the bandwidth is very high, but actually, it's also important because if we want to push to 4K. And also here, the virtual screen, it is a two-way communication. I mean, like, when you care about latency. So which means that when you typing, like, the user will not, the application will not accept large delay. Like, I mean, the latency requirement is much restrict, like, much restrict than the normal video call.

So in that case-

Thomas: Yeah. So there are quite a few specialist products for these kind of remote, you know, local remote desktop, you know. So wireless screens and things. So people develop usually much, much simpler codecs than AV1. So it's interesting to me that you're thinking about AV1 in this application domain, where lots of the competitor products that are maybe doing hundreds of megabits per second for on a local wireless link are using something very, very simple. So how do you see that balance? I mean, is complexity then going to be a real issue doing 4K screen share with AV1?

[00:26:55 Is complexity going to be a real issue doing 4K screen share with AV1?]

Eric: Okay, yeah, I, yeah, I can understand, like, we prefer a simple codec. Actually, the reason in my mind is that if we care about the end-to-end latency, there's two contributor. Like, one is the compression is from transmitting the data through a network. But if the codec itself is too complicated, then encoding latency will be dominating the end-to-end. So it's yeah, you are right. It's kind of between these two. So yeah, so.

Zoe: So processing delay as opposed to the network communications delay, yeah.

Eric: Yeah, so yeah, that too. So in this case, I hope that, I foresee that AV1 would getting more popular in this case once we have hardware encoding, encoder support to deal with the encoding latency issue. But actually, this is a very good example, like, use case for AV1 or for the Reference Picture Resampling. It's because, like, although the average bitrate of this node, the local network may be good, but there's some bitrate change in the middle of call. And again, like, we care about latency. So in this case, we need very quick adapt to the network, like, adaptation to the network. So in this case, like, the change in resolution is very important in this case. But again, if the changing resolution result in the keyframe, then there's extra delay. So you see, like, this is a perfect use case, like, this local screen sharing to use the Reference Picture Resampling.

Thomas: Yeah, so I think for people who don't work in the RTC space, it's really important to understand how painful keyframes are and how much trouble they cause if you're on a video call or doing something because it's not just that you have to send a lot of data very quickly because you can't predict it from what came before, it's that you might send too much data for the network. And so you might have to send it again. So you send your keyframe, and it's too big, and you get more packet loss. So you reduce the size of the keyframe, and you try again, and it's still packet loss, so you keep on going. So conventional codecs can cause huge pain, can't they, in these kind of network scenarios, where it's intrinsically uncertain how much bandwidth you actually have when network conditions are unstable?

Eric: Yeah, that true. And another perspective is that because in RTC, we want to keep end-to-end latency low. So in that case, we cannot do much buffering on the receivers. I mean, on the traditional VOD application, I guess, like, how they deal with the keyframe overshooting, that they have some buffering on the receiver side, so like, they can mitigate that bitrate spike. But in the RTC case, we cannot have that. So we care about, like, VBV delay much more in RTC than the, like, VOD use case.

Zoe: And the VBV buffer maybe have to be smaller in the RTC as compared to that.
So just now, I think you talk quite a bit of the advantage. Deploying AV1 for the RTC. So there's always got to be, comes along with the pros, right, you know?

So we'd like to hear that one during the process you're deploying AV1. So what are the challenges that you have faced?

[00:31:11 Challenges to deploy AV1]

Eric: Yeah, yeah, okay. Yeah, so during develop process, we do have some learning and encounter several issue. Like, the first issue is the binary size.

Zoe: Binary size, the binary size of the app, right?

Eric: App, yeah, app. Especially for the mobile app, we care about binary size. So I mean, the problem of binary size is that when we upgrading to a new codec, we have to include the codec library to the app. And the size, the library size is a main concern for us. Because the reason why we care about the app binary size is that binary size is direct to the, like, have direct correlation to the app update success, right? So if binary size is too large, then user may not be able to update to the new version and have to keep using the outdated version and also binary size effect like app starting rate.

And also, in general, like, the software health metrics, such like crash rate and memory usage usually also related to binary size. So like, in Meta, every time when we including the new feature, the binary size is the first thing we have to evaluate. And take AV, deep AOS example, we observe that when we integrate the deep AOM to the app, the binary size is very large. Like, it's 600k after compression. And before compression, it's much larger. But usually, we care about the, like, the first thing we evaluate size is, like, the number after compression. And I say 600k is really large. And to give some comparison, like, for OpenH264, the binary size after compression is about 200k to 300k.

Zoe: so 200k to 300k as opposed to 600k. Like, at least double.

Eric: Yeah, deep AOS that. Yeah, it's double. So that's the first thing. And we do have several way to mitigate that. First, if the binary size is too large, Meta, we do have a dynamic download framework, which means that we won't ship the library with the app, but instead, like, when user, first-time user start the app, and then we will dynamic download the binary, the library during the call, and then that can mitigate the issue. So usually, that's a first go-to for us when we want to integrate a large binary size library. But the challenge is that we, in our experiment, the dynamic downloading is not reliable, not perfect. So we did see some download value here, perhaps due to bad network or device issue, but for sure this download value would impact our user experience.

So instead of relying on dynamic downloading, we do want to optimize the binary size. So we did some investigation on the DBOM and have some binaries, we can optimize binary size. For example, like, one thing we observed is that in currently, there are implementation quantization metrics as according to our call quantization metrics. And it took about 10% of total library size.

So yeah, so we sum, but current implementation, the QM quantization metrics implementation in DBOM is not perfect. So we did some optimization on the implementation, so. And then we can save half of the size, like, reduce the quantization implementation from 60k to 30k. So yeah, we are thinking to upstream that improvement to the DBOM. But yeah, it is still under plan. But I mean, we have some optimization can reduce binary size.

But right now, we want to, right now we are in the process of developing AV2, right? So actually, Thomas, and I, and then Google has a proposal to AV2 AOM, like, if we can, like, from our proposal, if we can further redefine, revise the binary, like, the quantization metrics design. Actually from our proposal, the overhead of binary size can reduce to, like, can save up to 99% binary size. I mean, like, we only need 1% binary size compared to the QM, quantization metrics, in the AV1. Like our proposal, only need 1%.

Zoe: Oh, that's true, 1%, of the original, 99% reduction.

Eric: Yeah, yeah, yeah, 99% reduction. So yeah, so this is a binary size challenge. And also, like, we can play some system level, or app layer, or optimization. For example, like, for the app, in addition to RTC, maybe app also have other features such as, like, a video message, which has a cloud, device, transcoding on the device, which also need codec. So we can make those device share the same library. So that can, like, optimize the overall app binary size. And another ways that we can also use the building codec supporting the device. For example, right now, Android, if Android already support, like, new Android phone already support the, have some codec support, we can direct use the device-supported codec instead of adding new library in our app.

So yeah, so there's a couple way to mitigate binary size. But in short, yeah, we care about binary size. So that's something, yeah, some. Yeah, that's something we have to think about if we want to design a new codec. Like, from product perspective, binary size is an issue. And in addition to binary size, memory usage is also an issue.

Like, one thing I can share is that in our A/B test and in the first few round of A/B test, we see a crash, app crash regression. I mean, like, more crash in the AV1 call compared with the 264. And digging to that, we found that those crash regression is out-of-memory crash. And, like, at the beginning, we suspect that maybe there's some memory leak in the AV1 library, but after some investigation, we found that that's not the case. So we suspect that there may be, like for the user, like, there may be some other background app causing the leak.

But the issue is that if the codec use more memory, then there is a higher chance triggering a run-out, running out of memory and then triggering a crash. So which means that, like, having increasing memory usage, although there's no memory leak but will still resulting some regression on the crash from app's perspective, so.

Zoe: Yeah. What actually exactly caused more memory consumption for AV1 as opposed to 264, for example? Why like? Yeah.

Eric: Yeah, I think that could be some because there's more coding tool under, like, we need more buffer, right? So that by nature, like, we should effect more memory in the software implementation if for the new coding tool and more, yeah.

But the thing is that it would be great if we can try to reduce the, like, redundant or unused memory through some software optimization.

So yeah, so that's the challenge. It's like a challenge for us. And the third one is that AV1 need more computing resource. It's not surprising because-

Zoe: Right, yeah, yeah. This is a tunnel in spike rate, right? So I have to say that you gave the presentation, even though a lot other people already mentioned that, the deployment of AV1, we did something similar previously, but still, like, Meta is actually leading the trend, especially in the UGC and then there's a huge volume of user basis down there. And then you gave the talk that matter is deploying AV1 for real-time communications. Without hardware encoding available, you actually rely on the software implementation even on mobile devices. Then this is actually a huge message that has been passed down to the community because everybody still, I think it's quite, some of them still had the impression that AV1 was very complicated. There's a lot of coding tools make it really nice, but this is so complicated right now on mobile for real time and how could that happen, right?

So bring up this topic, people say, "Yeah, what about complexity? What about computational resources?"

[00:41:43 What about complexity? What about computational resurces?]

Eric: Yeah, yeah, that's true, yeah. Like, during develop process, yeah, that's really a challenge for us. Like, I mean we can select. Our first step is that we can select device. For example, we can only enable the AV1 for the high-end devices. And then that's our first step. And yeah, and you are right, like, on RTC, it's more challenging than, like, UGC or VOD side. On the RTC, in addition to decoding, we also need to do real-time encoding. Yeah, that's the challenges. But on group call, it's another thing. Like, on group call, if we want to support a large group call, like, 50p or even more participant, then the decoding become a bottleneck, depends on your UI layout. If you really want to decode a lot of participant in the same UI, then the decoding become challenging, yep.

Zoe: Right, so usually, we just still want to emphasize this message to our audience. Even though we take it for granted, so for video encoding and video decoding, the encoding side is a lot more complicated than the decoding side. So encoder always consume. The majority case or almost all the cases consume more computational resources and underlying power as to the decoding. But as you just mentioned, the group call.

And you have only one encoding, right, to encode a local video, but then you have so many incoming streams that need decoding when the number of decoding instances become increased, and then decoding is also become issue.

Eric: Yeah, that true. Yeah, that's the reason I feel that real-time communication is very interesting topic because, like, we have a lot of scenario use case we have to deal with, and then each use case has different challenges we have. Like, in P2P call, the challenge is encoding, but in group call, decoding even become more challenging. Yeah, so go back to adopting AV1, like, complexity of AV1, how that impact the RTC in our experience. Yeah, you're right. Like, for the high-end device, okay, we may be able to ship that, but we also want to cover the mid-range device.

So okay, also, so let's go step by step. Like, for high-end device, yeah, we assume that it's not that challenging, but as I mentioned before, like, even for high-end devices, there's some challenges, which was surprising. For example, like memory usage- Result in the app crash is one surprise, so. And to deal with that, how we mitigate the problem is that we designed some codec selection mechanism because we found that the memory usage is related to resolution, right?

So, like, the way we mitigate the problem is that okay, we start, we use AV1 only on the lower resolution, the mid resolution, but for the higher resolution, we fall back to 264. So this way can mitigate the memory regression or crash regression.

And actually in general, like, the hardware codec is preferred for high resolution because of the power consumption. So personally, I only, I also believe that's a good idea not to use the power, not use AV1 for high resolution until the hardware event codecs support become available. So that's one thing, so how we-

Zoe: Yeah, we know that Apple, the new phones start to support AV1 hardware decoding. But there's a whole guide. Like, the coverage will become larger and larger download for hardware decoding support for AV1.

Eric: Yeah, and we are also looking for hardware encoding support, yeah.

Thomas: That usually takes a lot longer.

Eric: Yeah, yeah, that is true. Especially our appearance, like, in first few release of hardware encoder, usually loss implementation is optimized for offline transcoding, not for RTC. It may take several iteration for hardware vendor to optimize their implementation for RTC.

Thomas: Yeah. And also hardware encoders for RTC can have a real problem with the very challenging content and rate control that you have for very low latency. Basically, it seems to be like a fire-and-forget kind of rate control. You'd throw a frame at the encoder, and hope it comes out the right size, and makes some adjustment. And that's very difficult to apply for RTC, I think.

Zoe: And sometimes we may have to do some reencoding 'cause it's not quite accurate, but that will also impact the delay, that kind of things. So we just hope that will happen down the road. But at this moment, it's very, very, I think, fascinating for us to actually listen to how you tackle this. At least you actually mentioned there's some optimization we can do for AV1. Like, you also mentioned that there's some use cases you active switch, right? Fall back from AV1 to 264. And so you maintain actually currently two codec format inside one app.

Thomas: So would you do that dynamically in a call if things change, if the resolution changes, or if the bit rate changes, or maybe if the local power consumption gets too high? Are these are the sort of things you do?

Zoe: Oh, okay, so basically, to Tom's point, not only just adapting to the network condition, but also adapting to the local power consumption. For the switching.

Eric: Yep, that. Yeah, that true. Actually, we did that. And actually, we found that very important. Like, one experience I want to share is that, like, I just introduced the mechanism, like, codec selection based on resolution. And we also have high-end devices. So like, we can have, like, if as long as we have high-end device and those mechanism, like, select a codec based on resolution, in most case, AV1, we can get, like, a reliable AV1 call.

But we found some issue on some high-end device, like some freeze regression or some even some crash, some bad user feedback. And after investigation, we found that although we have a reliability of high-end device, but sometimes lost high-end device may, like, have some issue on encoding AV1. And for example, like, although the device is high-end device, but there may be a background app on those high-end device, on the high-end device phone, and they're running out of, like, battery or like, causing, using a lot of CPU, and the device is overheating. So the device has to reduce the signal frequency.

And in this case, the device may not be able to encode in real time in certain period of the call, so. And then we do, like, in my presentation, we do collect some data from Messenger, like, collect the encoding latency data from Messenger. And what's surprising, what was surprising that for some device, the latency, encoding latency is really close to the input FPS, which means that lost device likely has an issue on real-time AV1 encoding. So yeah, so this is related. So I agree with Thomas, like, this is related to the issue or the suggestion Thomas mentioned. And then we do have, we do develop some mechanism. Like, in addition to a resolution, we would take more input from device health measurement.

For example, like, encoding latency is one thing. And then, like, power consumption or like, even battery level could be a good thing. So yeah, so my experience is that or my experience is, my learning is that, like, if we want to increase the AV1 coverage or have a reliable AV1 call in mobile device, actually, loads like codec selection, or like, device, considering device condition, to select the best coding or coding set, and coding set is very important.

Thomas: I was just going to remark. I mean, it's a general thing. I find that as codecs get more and more advanced and more and more complex, you can still make them very, very fast. But there's more variation between the fastest and the slowest. The worst case complexity becomes harder to manage because there are so many tools. So you have to put a lot of effort in an implementation into that kind of thing. For RTC, you need to hit deadlines. You can't just be fast enough on average. You actually have to be fast enough all the time, every frame.

Zoe: Yes, for every single frame, right? You have to be there. So just look, the granular is pretty small. You can't just average, "I have sliding window, I'm okay. I just input 30 frame per second, output out on average 30 frame per second." So right now, you really have to look at every single frame has to within that 133 milliseconds, for example. Sorry, 33 milliseconds. To make this really real time for every single frame.

Eric: Yeah, that true. And I just want to add another mention, another challenge is that right now, we are thinking about one challenge is that, okay, right now we have a good, like, good codec selection mechanism. But there is a penalty or overhead from the codec switch. Like, you can imagine once we select codec, there's a new keyframe the encoder need to encode. And in general, we don't want to, like, keyframe without a lot of issue. So actually right now, I'm thinking in my days that, like, the reason we have to select a, like, switch between codec is that the encoding complexity, right? So I feel that it could be a good idea that for the single, like AV1 encoder, if we can have the ultra-low complexity mode, which may, like, the complexity, like, the best scenario is that, this ultra-low complexity preset has similar encoding complexity to the 264 encoder-

Then if we can have that more, so if in the case the device do not have enough computing power, instead of switching to 264, we can switching to that low, ultra-low complexity preset, like, to prevent the codec switch. And one thing that to define that ultra-low complexity preset, the coding efficiency may not good because to have that, we may have to disable a lot of, like, good coding tool in AV1, but syntax-wise, we still have low syntax overhead. So it's even possible that the compression efficiency of less ultra-low complexity mode is worse than 264 normal preset, it's possible. But again, like, less mode is also, is still valuable because if we have less loss mode, we can prevent a lot of keyframe, we can avoid encoding loss keyframe. So, like, from codec, encoding compression itself, the compression efficiency may be worse, but from system perspective, because of less keyframe, the overall system performance may improve by having less than more,

Zoe: Right, yeah. Considering the time limit for our episode, I still just want to mention everybody would ask one question for deploying AV1 on mobile devices. I know we already touched it but asked, but let's say people wonder what about the power consumption, what about the power, the battery life of why you deploy AV1? What about the heating up? What about the user experience? Now, you get better quality, lower bandwidth, but what if actually the device got heat up, there's power that has been consumed too much, which shorten the battery life, which making the eventual use is worse. So how do you observe it? I know Meta is really care about user experience. Yeah.

[00:55:54 The challenge involves excessive power consumption, leading to shortened battery life]

Yeah, yeah, that true. So yeah, I think that's something, like, power battery usage or regression is something we have to take anyway if we want to have better, more advanced codec because there's more encoding complexity, so. And yeah, we do, we did some investigation on the power impact to the app. I mean, especially from app perspective, we did some encoding comparison and also, like, standard committee did a lot of study on how much encoding time increase compared between, like, new codec such as AV1 versus 264. The loss number has been available in the standard study or a lot of research paper.

But what we care about here is from app perspective, what's the power impact? Because the encoding is just one part of the app and then likely is not the major part because the most power made from, like, display and the network transmission, yeah. But yeah, to get a number, so we measure the power increase of a AV1 call on a peak cell phone. And how we did that? We take a phone to remove the battery and connect the phone to a power meter.

So the power is from power meter. And then when we start a call, and then we can measure, like, how many power consumed by a phone and measure the number by the power meter. And in our test on the Pixel phone, we found that the power increase on the AV1 compared to 264 is about 4% to 11%, which means that, and the number depends on the which encoding preset the phone is using. But in general, like, 4% power increase to 11%. And that is not a trivial number, so.

Zoe: You mentioned the 4% is the overall?

Eric: Overall.

Zoe: Oh, it's not just to compare the power consumed by 264 encoder?

Eric: No, it's the overall app.

Thomas: So the majority for power consumption comes from the app? And keeping the phone alive and those kind of things.

Eric: Yeah, and also actually, the majority power is from display. Like, so if you change a different display setting, you will see the power consumption change a lot, so. But the number I present is the overall power. As I mentioned, like, we-

Zoe: Switching from 264 to AV1. That will incur 4% power consumption.

Eric: Yeah, and actually that power number is measured for the whole phone. So if there's any background app is running, then it's also contribute to our number from power. So I mean, the power is from the phone. But in our local test, we make sure that there's no background app is running, but it's possible there's some thread running on the OS, which we do not have control on that. But look, the comparison is fair. I mean, like, the AV1 call and the 264 call, the setting is exact the same. And then we keep the phone as simple as possible. I mean, no any background app.

And then in that case, we see the whole phone power consumption increased by 4% to 11% and compared between AV1 and the 264. And that number would direct related to the, like, battery life. Like, how long the call you can make before you running out of battery. So this number, yeah, that's in our, in Meta, like, this number is very important because we learned that if a new feature regress the battery consumption by 1%, likely the talk time will regress by the same amount.

So the power consumption, the battery usage is directly related to the user talk time. Yeah, so that's the thing. So we care about the power consumption very much. But luckily, the quality improved, our freeze improved, also increased the talk time. So what we needed there, we have to find a-

Zoe: It's a compromise.

Eric: Yeah, so there's pros, con. So what we need is that we have to find a good trade-off, between power and the quality-freeze improvement. And yeah, so that's why we did a lot of, like, or we invest a lot of mechanisms like codec switch and then have a device list, and find a way to filter the phone capable of using AV1. Such of this effort is to find a good trade-off to enable AV1.

Thomas: So do you have other metrics that you use to evaluate the success of a codec? You talked about talk time. Are there other engagement metrics that are applicable?

Eric: Yeah. Yes, we do. Like, our top priority is engagement, actually, like, talk time. And also, when you try the Messenger call, you will see, like, after you end the call, we have user survey. Like, you can give a rating. And the user rating is also another metrics we care about. And the one thing interesting that I presented in my presentation, when shipping AV1, we do see user rating improve.

So that's a strong evidence that AV1 does improve our user experience. But that's a top-line, like, user experience metrics, but in addition to that, we also have some technical metrics to measure the quality. For example, like, freeze is one. And another one is PSNR metrics, which I want to have some time to discuss this.

Zoe: Yeah, definitely. So exactly. I think during the talk, you also mentioned, right? Because this is a real-time communications, you want to have some metrics down there to actually quantize how the experience is.

[01:03:34 Reliable quality metrics in RTC pipeline]

Eric: Yeah, yeah, yes, yeah. I really want to discuss this with you guys because you are, like, video experts, especially Thomas, you are very familiar with RTC. So, like, one challenge we have is that in current RTC industry, to the best of my knowledge, it's, like, I didn't find any reliable quality metrics in RTC pipeline. Like, use WebRTC as an example, WebRTC report the QP and the resolution. And it's more about the quality approximate because, like, QP's quantization level, then if we want to, it may work for compare the quality between in the same codec.

But if you want to compare quality between different codec, then QP doesn't work because quantization design is different. Not to mention that different codec has different blue filters, so also contribute to the quality freeze. And also another challenge that I see for us is that in RTC call, as I mentioned, we will, like, change the resolution during the call. So it's very challenging, like, when WebRTC, when we compare two call, and then we get two number for the resolution and the QP, but the question is that, how to compare that? Yeah, so.

Thomas: Yeah. It's basically impossible. I mean, I think I remember we did go through this leap when I was with Webex, and there were lots of issues because I mean, during the pandemic, there was big competition between the different vendors of video conferencing. And you know, one, you know, was Teams better? Was Zoom better? Was Webex better? You know, in terms of video quality. And you would get reports about comparing the two, but it's incredibly difficult because you would try to take, like, PSNR measurements, but you have to sync up the frames between the output and the inputs. And there are all kinds of very strange methodologies people tried.

I think the short answer is that we really need to develop a good reference-free metric without looking at the original video. I think that's the way to go for RTC. That would be my opinion. Because you, the user, doesn't have the reference. And if you're changing the resolution the whole time, it's very difficult to define a good reference-based metric. I mean, there are some attempts at it. And when you have different codecs, you have different artifacts as well, which may, you know, reflect well or better with different codecs.

So I think it would be good if there was an industry standard for a single-ended metric. I mean, there were tools that were, that I'm sure the video conferencing providers use and experiment with. But to get a single number, I think, it needs to be on the received video. And there are lots of other things like the number, like intra-frames affect your perception of video quality. Even if the average PSNR is really high, there's an effect that maybe the worst case video is what dominates the user experience. If they get one or two seconds of terrible, glitchy video, they think the whole call is bad, even if the PSNR is really high overall.

Zoe: I think this topic has been because we all know that there's a standard organization referred to as Video Quality Experts Group, which actually has been there for a long while and still constantly, continuously, persistently pursuing how to actually measure video quality. As I think both of you already mentioned that the freezing, right? So video not just looking at the per frame quality, video is a three-dimensional signal. There's a temporal layer, that dimension. There's sometimes, like, you got video frozen that actually is a very, one of the best, one of the worst experience the user would be backed.

And so how to get that measure? We're also interesting to learn that how, what's the current adopted scheme on the Meta side to actually measure the user experience in terms of video quality or video experience overall.

Eric: Yeah, so I'm proposing a method. So yeah, so right now how to measure the video quality? We developed a system. Yeah, like, first of all, I agree with some of, like, at the beginning, I think about the non-reference measures because, like, it's more reliable. We don't need the source frame, and then we can do that on the receiver side. Especially one thing is that, like, although the sender may send high resolution but only receiver render, maybe, like, in terms of UI, maybe actually we are rendering something or some, like, small resolution. So actually, that's the real user, like, how users see the video. So the receiver's side's metrics makes more sense for me, especially non-reference.

But the challenge we have is that for non-reference metrics, there's, I personally, I feel there's still some gap. Like, I mean, there's still a research, a good research topic. And also, in pipeline, the complexity is some concern because we don't, as I mentioned, like, more complexity would reduce the battery and reduce talk time. We don't want to reduce talk time just for calculating the metrics, right? So that's a challenge.

So right now, in Meta, I'm proposing, I'm trying to promote a new metrics on the sender side, which is the PSNR metrics. And how I do that is that I found that when we implement the PSNR, first of all, on the sender side, we have source video, but we may not have decoded video. We have decoded bit stream but not decoded video. And that's one challenge. So it's not trivial to do, like, compare because we don't have decoded video.

And the second challenge is that we will scale the resolution on the sender side according to the network. So have different scale make the PSNR calculation more challenging. So what I propose is that right now, I propose, in my presentation, I propose a scheme that we estimate the distortion from scaling and the distortion from the encoding separately. Like, for encoding, we modify the encoder to report the distortion during encoding. And then on the scaler, we have some simple mechanism to estimate the scaling distortion. The straightforward implementation could be that when you scale down there, you just scale up the source file and compare how much distortion introduced by scaling.

But we have some optimization to reduce complexity. But once we have scaling distortion and the encoding distortion, we can combine them, and then calculate the final PSNR. And there's some detail, like, because at the beginning, we assume that those two distortion is independent, so we can directly add that to calculate distortion, but we found that it's there's some correlation. So it's not surprising, for sure, there should be some correlation. So we can modify, but consider that correlation, we can exactly modify our PSNR calculation in math or introduce a correlation term.

And in our experiments, like, calculated these distortion, decoupled those two distortion calculation and combined them, and the final PSNR is very close to the ground truth PSNR, which ground truth means that we decoded source, decoded video, and upscaled back, and then calculate between the ground truth decoded frame in the original resolution to the source video. So the number is very close. And we did something, the optimization. So right now, on the Pixel phone, the calculation of the PSNR in sender side take only about 700 microseconds.Yeah, we integrated that to our pipeline.

So one thing I want to hear from Thomas is that I'm thinking whether, like, this could be a good candidate for the artist. For example, like, right now, we are thinking to, like, contribute this implementation to, let's say, like, WebRTC. And I'm wondering, like, from your experience working on RTC industry, how do you think for this proposal, like, this part you mentioned.

[01:12:47

Thomas: Well, yeah, I think it would be great to have something that's codec-agnostic. It would be interesting to try and correlate it across different codecs. But I think that it's much better than what's reported now, so that would be an improvement. And it's quite an ingenious solution to avoid having to decode and upscale 'cause that's a lot of complexity. I did wanna ask you, I mean what, whether this was very much optimized for camera video? 'Cause you gave the example of Inter Block Copy versus palette mode, where in PSNR terms, you could get a gigantic saving within Inter Block Copy. But subjectively, palette still looks better 'cause I guess Inter Block Copy is copying low-level artifacts to lots of places in the P. And generally, I see that PSNR can be very uncorrelated for desktop video compared with for camera video, where the correlation can be quite good. So is it your experience that it's best for camera video or how do you see it?

Eric: You mean the PSNR or you mean the coding tool?

Thomas: I mean, how accurate do you think this tool would be if you had desktop video for measuring the fidelity generally?

Eric: Okay, and you mean, like, how accurate the proposed PSNR to measure the camera, camera video versus-

Thomas: I mean, how well it correlates to subjective quality? I think we- Yeah, we have a good understanding for camera video. But I think it- So sometimes, it correlates much less well. I mean, I've seen some very good screen content video where the PSNR is like 33 dB or 30 dB, and it looks fine. And then you see something at 45 dB, and it doesn't look good. So there's huge variation in desktop video, I guess.

Zoe: Right, I think that Thomas particularly mentioned that, you previously mentioned there's two AV1 screen content coding tools that very efficient. One is the palette, the other is IBC. And you also mentioned that for palette mode, is actually preserve the color fairly well and also making subjective quality very well. But IBC, maybe on the other side, they may or may not, they may potentially generate a good PSNR, but the subjective quality is not good. So the question is, if we use PSNR as a metric, would that be really correlated well with subjective quality? Specifically for those desktop screen content. So that's the question in your experience.

Eric: Okay, okay, okay. I see the question. Yeah, the short answer is, I don't know. Like, you know, we are struggling, we're still struggling to have a how to have a quality metrics. So right now, the PSNR is the only solution we can have. But I can understand your question. Like, I foresee, like, on screen content, the PSNR may not be the best measurement. But yeah, I don't have good solution in my mind yet.

Thomas: Yeah, I mean, I wonder whether doing SSIM might also be a possibility the encoder could calculate that as well. I don't know how well you can predict that across scales, but maybe you can.

Eric: Yeah, yeah, I think that is an interesting suggestion. Yeah, like before, I also think about SSIM. But again, like, the issue we have is the complexity. Like, for offline comparison, we can use SSIM or even, like, some, like, FSIM. So we can use some, like, more advanced metrics. But in product, we are looking for in-product metrics. And right now, like, we even have a challenge to implement PSNR in terms of complexity, so that we have to do some, like, sensing thing to decouple the distortion. So yeah, I keep be asked when I'm proposing PSNR, people, like, always ask me, like, "We know PSNR has some limit, so why? Why do we not even consider like VMAF?" But the thing is that yeah, I know that PSNR is not the best metrics and there is a lot of limit, but here, we are looking for some metrics we can implement in RTC pipeline because we care about, like, we are collect, we care about the in-product experience. So less metric computation should be very light and low complexity.

Thomas: Yeah. But actually, as to propose it to WebRTC, I mean, that does actually sound like an interesting idea because if you think of something like an SFU, a video switch, it can monitor the, if it had the PSNR, it could monitor that for all the streams and suggest to an encoder to reduce its resolution or that it could increase its resolution because the video is easy or hard for the resolution that it's chosen and the bitrate that it's been allocated. So it could allow for a lot more intelligence on a switch, so it could be really useful.

Zoe: Right, so not only use as a gauge for the quality, right, but also as a guidance of why you really integrate that in the real communication pipeline that the metric can be leveraged as a potential guideline and hint for some of the strategies that we use for encoding. So not only we have been talking about real-time encoding, we have talked about real time on quality metrics, like, how we're going to get that because has to be real time, had to be light, has to be also low power, use less computation resources, and use less manage memory. So this is basically is the real-time thing has another layer of constraint, which actually also bring up potential topics that we're talking here.

Eric: Yeah, yeah. And I really like the comment Thomas mentioned because, like, that resolution decision also an important use case for Meta because every, like, people keep improving the resolution decision, like, make it even more aggressive or more conservative. But the question we already have is that which one is better?

Like, if you encode a larger resolution, then you have more compression versus, like, less resolution and it has, like, better encoding. But yeah, as someone's mentioned, like, that should be content-dependent. But from WebRTC, if we only have QP and the resolution, we cannot have clear signal to direct us to the way for optimize the resolution. So that's another reason we think we are using, like, we are proposing this PSNR, in-product PSNR metrics.

Zoe: Right, yeah. So I have to respect the time, but we have been talking about mainly real time that AV1, how AV1 actually has been already, right, deployed on the Meta side for real-time communication on mobile. And then so Eric, thank you so much. Actually, you shared with us the good things and potentials as well as the challenges everyone has brought out. And in the end, we actually, the second part, we talked about the quality metrics 'cause you want to quantify the qualities as well as it actually gave us a good guidance and to help us make decisions while we encode videos in real time.

I learned quite a bit as always. And then thank you so much. We actually looking forward to experience more delivery from the Meta side as well as I think you also mentioned during the topic, there's new codec that is being developed, which is the successor of AV1, referred to as AV2 at this moment by AOM. And then there's also something that we can do to address AV2 to serve better for the real-time communication cases. So really appreciate the time. And thank you, Eric, to come to our podcast. And yeah, welcome to put closure down there, thank you.

Eric: Thank you. I enjoyed the discussion.

Thomas: Thank you very much.

The VideoVerse

TVV EP 23 - Adventures with AV1 for RTC with Eric Sun

Listen to this podcast on