TVV EP 04 - Thomas Davies - AV1 for WebRTC

Thomas Davies is a Principal Codec Engineer at Visionular working on all aspects of video codec development and optimization with a special focus on RTC (real-time communications) use cases. Thomas developed the first real-time AV1 HD software video codec, for Cisco Webex, writing video enhancement algorithms, video analysis, real-time pre-filtering, and post-filtering for Webex clients and infrastructure. Thomas is a key contributor to the AV1 video codec standards development and his contributions include key codec technologies, such as entropy coding design, quantization, error resilience, and bitstream syntax.

In this interview, we dive into a discussion about how RTC encoding is different from file-based streaming or live broadcast streaming. Thomas provides key insights into the main challenges around how to deal with lossy networks, low latency, low bitrates, and difficult rate control issues. We touched on methods for maintaining AV sync in software-based clients, how to optimize for minimal CPU usage, what to do with intra frames, SVC, and more. This is an episode packed with knowledge. You will want to stay to the end.

Watch the full video version.

Learn more about the Visionular AV1 RTC encoder.

Thomas Davies- I'm Thomas Davies. I'm a tech lead at Visionular. I work on video codecs and video processing. I've been in the industry for a good few years, working on RTC applications, broadcast, and so on.

Mark Donnigan- Welcome to the show, Thomas. It's really awesome to have you on Inside the Videoverse. RTC is a hot topic these days. So Thomas, maybe with that, before we start talking about video coding for RTC, give us a quick overview. You are working in this space as a video codec developer and engineer. What has changed, and how has the industry and technology shifted over the last 10 years?

What has changed in RTC?

Thomas- The thing that's changed most obviously has been the enormous scale in recent years, you know, with the pandemic. And therefore, the complete reliance that people have had on using video conferencing. And that's been enabled, behind the scenes, really by the cloud revolution. So, being able to deploy video conferencing services in the cloud has enabled that very rapid increase in scale.

So, there is scale, and then there's the ability to meet that scale really by moving things out of on-premise solutions in companies, data centers to publicly accessible data centers. So that's been a huge change. But the other thing that has changed has been simply the technology under the hood. The video codecs themselves have got much more advanced.

Conferencing for a long time has been stuck on H.264 and, more recently, VP8. But now, new codecs like AV1, and to some extent, HEVC, have come into the picture. And those have revolutionized what you can do with video conferencing in terms of quality.

Mark- Yeah, it's definitely something that I have observed. And so in this interview, we'll focus on video coding for RTC. It's a very unique challenge. With RTC, of course, we are, by very definition, talking about real-time. And real-time not like two or three seconds latency, but like 100, 200, 300 milliseconds latency. And that's end-to-end.

What does being real-time mean?

Thomas- So, real-time means a couple of things. It means being fast and being efficient. And for video conferencing, the encoders in particular, are really fast, compared with those other applications. But it also means hitting deadlines. And hitting many deadlines per second, every 30th of a second. And that's really tough to do when the whole thrust of video coding techniques is about working out where to spend your effort. But in RTC, you can't spend lots of effort on one special frame and use that to predict all the others. You have to spread your effort around.

So you have to hit those deadlines and still produce good video quality. And I think another aspect is that you have this, not just very low latency, but you have lossy networks because you don't have time to deliver a frame again. Once it's been sent out, it's gone, you have to make the best of it. So you have to be able to cope in a lossy environment. And that's partly a system feature to have that resilience. And there are system techniques that you can use to achieve that. But it's also about allowing your encoder to be configured to recover from losses.

And I think the last thing I'd add is a great deal of adaptivity because networks constantly change. And the reason why you are getting loss is that there's congestion. And one of your responses to loss is that you might need to change your resolution, you might need to change your bit rate. So you need to turn that round on a dime. You need to adapt very quickly, in a way that looks seamless to the receiver and does not cause disruption.

Scalable Video Coding and Rate Control.

Mark- I know that scalable video codecs, SVC, have been talked about in other standards for years. And yet AV1 is the first where it's been implemented in the baseline. Is that correct?

Thomas- Well, it was in VP9 as well that you could do scalable coding. I mean, what's interesting about AV1 is its resolution agnostic. You set a maximum resolution, and then you can send any picture size you like, so long as it doesn't exceed that. So, you can already adapt resolution, and you don't need to send a keyframe. So that's a useful feature that has only been added to other codecs in VVC, which adds resolution adaptation. But it's natural, when you have a resolution agnostic encoder, to do scalability. Because it's just predicting between different resolutions and partitioning off those resolution layers for different receivers, it's a very flexible system, yeah.

Mark- One of the things that come to mind when we think about this speed, I like what you said. It's like, every 30th of a second, that frame's gotta go out. And if it doesn't quite make it or if there's a problem, well, we'll try to do better on the next one -- this is a unique challenge compared to the world a lot of us come from, especially in VOD. But even broadcast live, you can adjust your latencies. You know, a football game, even though it's becoming less desirable to be delayed, can be delayed 15 seconds. And so you can operate your encoders differently. So, what's the impact on the rate control? And talk to us about some of the things that you have to think about as a practitioner.

Thomas- Yeah, so when you study rate control, you open a textbook on video coding, look at your leaky buffer model, leaky bucket model, and so on. The thing is, over the internet, and especially for RTC, almost every part of that is semi-fictional. So, you are trying to communicate with the far end. And you have a certain bit rate that you're trying to hit. But that may not represent the real bandwidth that's available to you in the network because you're competing for bandwidth with other services, usually TCP-based services with greedy algorithms trying to steal your bandwidth. So, you may have an instantaneous bit rate that's fluctuating quite a lot. So that doesn't fit with the usual models that people have. And that speaks to the adaptation that you need.

RTC Coding and Latency.

The other thing that is peculiar to RTC is that you don't get a chance to see the future. So, you are proceeding as best to stay within your rate control limits, but the content may change completely. And you have to cope with that. Whereas in a VOD case, you get a scene change, you can code that a couple of times to work out exactly how best to allocate bits. But you have less chance to do that. There are some things that you can do. The other thing is that latency does not just come from coding time. It comes largely from the size of your buffer and the variation in the size of your frame. So you have to keep your frames reasonably similar sizes so that they will not take a long time to receive and then cause display issues and latency at the far end.

But on the internet, you can probably get away with sending a few larger frames from time to time, so long as you don't send too many, because you are in statistically multiplexed large buffers on routers and switches and so on, most of the time. But even so, you do have to keep the frame sizes pretty close to constant as much as you can to reduce that latency.

Mark- From a video coding perspective, how does this affect, like, sync so that all of the users have a similarly good experience? What do you have to think about?

Thomas- Yeah, so I think a lot of sync problems go away the less latency that you have, obviously. So you start to see these sync things occurring when latencies get high. Yeah, more than a couple of hundred milliseconds. More than 250, something like that. Usually when that happens, an individual receiver may well struggle to synchronize audio and video.

Now, generally, on a client, audio is actually the most important thing because you want to hear people. Video is important, but it doesn't convey meaning in the same way. So, you're prioritizing audio in the client. And you are therefore trying to sync the video to the audio rather than the other way around. So you want to preserve the audio as much as possible, and then the video will get synced to the audio. Typically, you don't tend to share audio from multiple users, except that, increasingly now, in a conference scenario, you may want to not have entirely switched audio. You might want to mix them on a server. And then, you have a synchronization issue because you want to make sure that you don't delay things too much, but you will have to delay some things in order to make sure that they're reasonably in sync.

You are mixing because you want to try and get a better ambiance and so that people can interrupt better so that you can hear them starting to interrupt. But if you have an entirely switched experience, then you don't tend to have too many of those kinds of problems on a server. But you need some kind of algorithm to pick the most appropriate audio, usually based on loudness or something like that.

Mark- As a video engineer, how much are you getting involved in audio? Or is there usually someone else on the team or working on the product who's thinking about the audio aspects?

Thomas- Yeah, so the audio focus is much more on getting those packets through and making sure that the audio latency, end-to-end, is as low as possible. That audio gets prioritized correctly, that audio metadata is preserved, like loudness and so on. And then there are issues like audio scale.

So, suppose you have a transcoding server for video conferencing. In that case, there are similar transcoding service for audio, where you want to take in telephony calls, standard PSTN calls, and transcode them to low bit rates, standards like AAC or whatever, or Opus. And those have unique technical challenges because you can do that at enormous scale with thousands and thousands of calls on a server, but to do that well is really an interesting problem in high-performance computing, to be able to run that many individual processes without being completely destroyed by the threading issues and the process communication issues that you have. So that's a kind of interesting specialism within the audio world.

High Performing Encoders for Low to Mid Mobile Devices.

Mark- How do you approach, and how do you think about designing a codec, or optimizing a codec, optimizing a solution, that might be deployed on a lower or mid-end mobile device? So ARM and, I don't know, four cores, or six cores, however many cores are available. Obviously, you can't use them all. And then you have someone else who's on a MacBook Pro. And so, what does that look like? And what are some of the challenges that you have to think about around CPU usage? That's a significant constraint.

Thomas- It is a significant constraint. I think people starting out maybe tend to assume, well, real-time means we run at 30 FPS. And in the real world, if you tried to do that, you customers' machines, consumer's machines, would just seize up because they wanna do other things as well. They wanna have their video on, they wanna have their share on. So you might only have a fraction of their CPU available. And they might also be doing things that are compute-intensive. So, they might be sharing a screen where they're playing a video in that screen.

So you are effectively transcoding something from YouTube or TikTok or something, and they are sending that down the screen. And that's consuming significant resources. So, you do need to be able to adapt your CPU. But also, you just need to use not very much, and less than you might think from trying to achieve your 30 FPS. You actually have to achieve higher speeds than that.

Then there is also the issue of worst-case frame time. So, it's not enough just to be fast enough on average; you have to hit your deadline. So you have to tackle your worst-case frame time. And when the content can change completely, that can be very challenging. You know, you have effectively a scene change to deal with. You have to produce some kind of video in that time. And that will mean, from time to time, the quality may need to dip in a way that it can't do on VOD. But you just have to accept that, and you want to minimize it. So you need to have many speed levels and to be able to adapt between them. And you need those speed levels to be optimal for this, give you the optimal quality you can for the CPU that you have. So you're always in the space of trade-offs.

So, by contrast to, say, a VOD scenario, where you are trying to give the very best quality video because it's going to be consumed many times, in RTC, the video is disposable. It's usually consumed once. So on VOD, you're trying to get as close to perfection as you can, but then keep the costs reasonable, you know, within some kind of budget. Whereas with RTC, you're always in the world of trade-offs. What's the best quality I can do with the CPU I have now, which is a kind of different perspective.

Intra-frames and RTC Coding.

Mark- You know, in the VOD world, of course, intra-frames are our friend, right? But it seems to me that, in the RTC world, intra-frames are a challenge or a problem. So, how do they get used?

Thomas- They get used in a few ways. So, you need to start with an intra-frame because you have to start with something. And therefore, if you need to restart, you need to get another keyframe. And the kind of reason you might need to restart is because you've had loss that you can't recover from. So, there are various strategies to provide recovery mechanisms so that you don't have to send a keyframe. The other thing that you can do is to try to conceal the quality of your keyframe, maybe send something that is not such high quality, but try to recover quickly over time. What you can't afford to do is send the perfect I-Frame, you know, that is entirely pristine, that might take up a second's worth of bandwidth. Because not only then will it take a long time to deliver, but if you're trying to recover from loss, it might get damaged in transit. It may lose some packets. And so you will have to send it again. So if you were sending a super large intra-frame, you don't wanna send it five times and get many seconds of loss.

Mark- In AV1, there's this notion of an S frame. Explain what an S frame is because I'm not sure everyone is even aware of the S frame. Is it helpful to us in RTC?

Thomas- The idea of an S frame is that it allows you to switch layers. So it gives you something that you can pause because it uses a special resilient mode that gives you a pausable frame. You can always decode it, except possibly for predicting from previous frames. Now, that means that it also resets the buffer, the reference frame buffer. And that's particularly useful if you want to switch up to a higher layer if you are doing scalable video coding. So, it gives you a switching point.

Usually, when you want to go to a higher bit rate as a receiver, you can ask a server to send the next few layers. And you know that you can decode them because you have this S frame. Now, this other feature is present in previous codecs, like H.264, which is some kind of intra-refresh GDR, gradual decoder refresh, where you can start with a stripe of intra-blocks and move them across the frame. And if you can stream your motion vectors, a receiver can begin decoding. But it can't produce a displayable output until that GDR has swept right across the frame. Now, in AV1, that would be difficult without an S frame because lots of data in an AV1 frame depends on previous data. But an S frame allows you to refresh things. So it should be possible to have a GDR-like approach using S frames for people to join streams where they don't get a keyframe. Or to switch streams, for example.

Mark- And these features can be built using a baseline implementation? Or is there something special required on the decoder?

Thomas- Yeah, so a decoder would need to be robust to not having a keyframe. So it would have to not fall over if it doesn't have reference frames in its buffer. So long as it gets a sequence header, it starts decoding, and then it would also need some signaling at a higher layer or in some special metadata in the bit stream to tell it when it should be able to display. So there's a little bit of metadata needed, and there's some robustness in the decoder not to fall over and die if it doesn't have all the reference frames.

AV1 SCC Tool.

Mark- Inside AV1, I know that there's something called screen content coding. There's a screen content coding tool that is quite powerful. You've done a lot of work on it at your former company and, of course, at Visionular. I'm not sure the audience knows about screen content coding. What is that tool, and why is it special?

Thomas- Well, I think one thing that's special about it is it's built into the main profile. So you can always use it. And I think this is one thing that stymied previous codecs, where there were tools. For example, in HEVC, they were in a different profile. And they are powerful, but they're not needed all the time. And they do have some implications, particularly for hardware complexity, which is why they had been put in different profiles before.

So, these tools include intra-block copy, which allows you to, within an intra-frame, or a keyframe, rather, to copy blocks from one area to another. So you're effectively doing motion compensation but within the same frame. Then there's pallet coding. And that allows you to express a block as a list of colors. You identify pixels by what color they are. So you describe a certain number of colors. Then what's also useful is having a very wide range of transforms.

It turns out that the statistics of screen content of PowerPoints and spreadsheets, and that kind of thing, is very different from that of ordinary video. To code it efficiently, different transforms are needed to de-correlate the data optimally. And those all make quite a bit of difference. But the other thing I would say to you is that to produce a really good screen content encoder, you have to optimize how you approach the encoding process. And that's some of the secret sauce that you can apply. So, it's not just the tools; it's applying them correctly.

Mark- Thank you, Thomas, for coming on and sharing all of your experience. It's an exciting time to be in video.

Thomas- It certainly is. Thanks very much for having me. It's been great fun, thank you.

The VideoVerse

TVV EP 04 - Thomas Davies - AV1 for WebRTC

Listen to this podcast on