In part 2 of our sitdown with Vimeo's principal codec engineer Thomas Daede, he shares how AV1 is making 10-bit HDR the new standard for all online video. He also shares his views on the best way to evaluate the quality of an encoder, including the ever-debated VMAF scoring method. Is it the best approach, or are there other scoring systems that should be included?
Watch the full video version.
Learn more about Visionular and get more information on AV1.
Nathan: Welcome back to the VideoVerse. This is actually part two of a two-part interview that we did with Thomas Daede of Vimeo. So if you haven’t gotten a chance to listen or watch the first one, go back and check that out. Otherwise, we’re gonna jump right in with Thomas. There’s something as a term I wanna throw out, and I have a feeling I can just say it, even though I kind of know what it is, but I want you guys to explain. T.35 metadata. Talk to me about this and the significance in AV1, particularly with HDR.
Thomas: Yeah, so one of the big things we did with AV1 that was an improvement over say, a VP9, which is one of the codecs that’s based on is we added a lot of support for in bitstream metadata signaling. Before this, we could signal some of this metadata into container, but that was kind of didn’t match up with what other people were doing with other codecs and it was an easy way for that data to get lost. So we actually explicitly added this basically features to add all sorts of color and HDR metadata into the AV1 bitstream itself. One of those is simple static metadata. So we can now say we can code like color mastering information. For example, for plain old HDR10, we can encode all of the parameters it needs into the bitstream. But we also added T.35 metadata support. T.35 is a previously standardized form of metadata that can be included with any video bitstream, it’s actually codec independent. But it lets you basically insert any sort of metadata, but one very relevant piece of metadata that you can insert is dynamic HDR metadata. And there’s a couple of different standards for this. One for example, is HDR10+, which adds extra T.35 messages along with frames. There’s actually a standard for increment implementing this with AV1. If you search for the HDR10+ AV1 spec, you can find it, but it also allows us to do dynamic metadata in the AV1 bitstream for HDR.
Nathan: So does that mean mere mortal brain here interpreting? Does this mean that potentially, you have kind of a universal HDR information or maybe not universal, but you’d have it that would support HDR10 and Dolby Vision, and you wouldn’t have to worry about which version of HDR your Apple TV supports versus your television versus… Is that what direction this is going in?
Thomas: We’re not quite there yet, unfortunately. I love that for us to be the case. Maybe sometime in the future. What we do have is we have the T.35 metadata is codec-independent. So it can work with AV1, it can work with older legacy codecs, can work with whatever else. And that data, once it’s taken out of the codec, can then be sent as metadata directly to the TV over HDMI, if it needs it, then that can be sent with it. And that is standard. What is not standard yet is that the T.35 metadata itself, there are couple variants still, unfortunately. So there’s HDR10+ is one, Dolby Vision is another. And so those are still incompatible. You could include both with your stream. You could also allow those like HDR10+ will gracefully degrade to HDR10. For example, if the TV doesn’t support the dynamic metadata, it will just use a static metadata. And you can design it, such that it’ll degrade in a way that’s still acceptable and viewable. It’s just is not as nice-looking, for example. But we still, unfortunately, do have multiple HDR standards with that T.35 metadata. And so you will have to pick one if you want to use with AV1.
Nathan: Okay, we’re getting closer, but we’re not quite there yet.
Zoe: Well, theoretically, the decoder, it’s the main job for guiding such kind of metadata of the decoder is just extracted that kind of metadata info from the bitstream and then guide this metadata. Pass over to the final render and render, well, based on such kind of metadata and do necessary transformation conversion down there before the video is being finally rendered.
Thomas: Exactly. For example, if you’re using a DAV1D software decoder for AV1, there’s actually an API where you just say, please give me the T.35 metadata for this frame. And it basically pulls it back out. And you get from DAV1D, you get the final video in terms of like the YUV frame buffer, and you get to T.35 metadata attached to it. And you either might send that directly to the display device. Or if you’re doing some software processing, maybe you’re like tone mapping it to SDR or doing something else, you’ll pass those both to your tone mapper. And then you do your final steps there of color conversion.
Zoe: Yeah, then I have a follow-up question. I really just want to confirm with you. For example, if I have the same T.35 metadata inserted in, say, HEVC bitstream, and another one is inserted into the AV1 bitstream. And if the decoder, one is the HEVC decoder, there is the AV1 decoder, they are both able to extract such kind of metadata out. Supposed the metadata inserted in both different standard bitstreams are identical, then the decoder will decode bitstream into pixel domain, the final image to be rendered. And then, the player will handle the final rendering. The final, final rendering together with the metadata info. So in this stage, do you see this stage is anything related to the standard, or is actually is the codec standard agnostic?
Thomas: That part is codec standard agnostic. So the only thing that needs to be standardized, especially in AV1, is basically how we insert the T.35 mechanism, and AV1 is very flexible. So, for example, if you look at HDR10+ in AV1 mapping, it has to specify how those T.35 messages are attached to frames because AV1 has things like hidden frames that can be used as references for other frames. There’s a little bit of details to specify which frames you want to attach to HDR metadata to, and how they get extracted. But once it’s extracted, and once you get those bits out, then that’s purely codec independent. If you have HEVC stream, or if you have an AV1 stream and you get that HDR10+ metadata. The processing from that point onwards is exactly the same.
AV1 is ready for Dolby Vision
Zoe: Yeah, then ’cause we got sometimes, some pattern customers ask whether AV1 support Dolby Vision. And so in my understanding, technology wise that AV1 decoder is able to extract, for example, a Dolby Vision metadata info from the bitstream. This is what is within the scope of the AV1 standard. So AV1 already support that. And then once tracked metadata, for example, it’s a Dolby Vision format and send to the final render because that part is codec agnostic. So then, how do we? Can we, but it that’s the same. We can to tell the users that currently, AV1 is already support Dolby Vision. So this is something, whether this is something actual beyond the technology that can determine.
Thomas: Yeah, AV1 is technically ready, for example, Dolby Vision. It has all the features you need for that. Because Dolby Vision is Dolby’s format, they would probably want to specify, here is a Dolby Vision in AV1 mapping. Kind of like there is one for HDR10+. So I mean, because it’s their standard. I can’t say to officially supports Dolby Vision, for example, because I think that’s up to Dolby to say. But if they would want to do it at any time, all the pieces are ready to there. They will just have to write a very simple mapping, and it will work.
Zoe: Got it. That’s very clear so that we can mention that on the AV1 side, the AOM side is ready to support Dolby Vision, and then the left part, we really hope that there could be some collaborative effort down the road to finally let the users to enjoy Dolby Vision in a wider range, for example, including the AV1 down there.
Nathan: Yeah, that’s exciting. Thomas, you’ve mentioned that you’re currently, obviously, at Vimeo, you’ve been working on the encoder and encoding development there. Can you talk to us a little bit, as much as you’re able to, what is it that Vimeo is doing in regards to AV1? When are there any unique challenges that a huge platform like Vimeo has to deal with that might be different from others?
Vimeo and AV1
Thomas: Yeah, so there’s a couple things. One is that Vimeo is a user-uploaded video platform primarily, or there’s business accounts in the like two, there’s all sorts of other businesses, but it’s primarily a kind of a self-serve, you uploaded video, and it appears quite a lot. We have a lot of videos that get lots of views, a lot of videos that get very few views. So if you compare it to something like Netflix, which has very small relative video collection that has, and most of those videos get many, many, views. We have many more videos that get proportionally less views. So there is some difficulty in there and that basically, we have our cost for transcode relative to the view are different to Netflix’s.
So one thing we have to do is make sure we can transcode AV1 and it’s a reasonably cost effective thing to do and make sure that we still give a big benefit to our customers that get AV1. We currently use the rav1e open-source AV1 encoder to do our own encodes. You don’t get them on every video. They’re currently on, for example, our staff pick videos and other videos with high view counts that we know we’re gonna basically get a return on. We’re looking at basically increasing that though. For example, by also potentially doing AV1 HDR where we can also not only just improve the compression, but we can basically give HDR to more customers that wouldn’t be able to see it at all. For example, at the web browsers.
Zoe: Really, yeah, potentially, the technology will benefit the users.
Nathan: Yeah, for sure. And I’m imagining on both ends because with more and more phones, shooting HDR and uploading HDR source footage. All of a sudden, now, it becomes even more valuable to keep that kind of that full HDR stream from glass to glass.
Thomas: Exactly. So phones, there’s like iPhones now. So, for example, upload HDR often by default, people often will record an HDR without even realizing you’re doing it. And so, it’s increasingly important for us to both, for example, when a user uploads an HDR video to make it look just like they saw it on their device. So, for example means replaying them back either HDR or if they’re viewed it on an SDR device, we want to tone mapping it well. So we also have our own tone mapping that even if we’re generating an SDR stream, we’ll do all appropriate tone mapping on their metadata to make sure it looks similar to how they would see it on their own SDR screen.
Nathan: Very cool. Slightly different topic. Something that always comes up when we’re evaluating video quality, encode quality, when we’re getting into the nitty gritty is of course, VMAF. It’s always a topic that comes up. Can you give us kind of a quick high-level view of what VMAF is for anybody that doesn’t know what it is? And then talk to us a little bit about how you guys are using it. I know it’s got a couple new tricks. Things that it’s doing for us and maybe why it’s important. That was a lot, I know. But if you can start us off with kind of what is VMAF and the importance of it.
VMAF and its importance
Thomas: So VMAF is one of many attempts at a objective visual quality metrics, so.
Nathan: I like how you worded that.
Thomas: Yes. So basically, the problem for the longest time is how do you look at a video and tell that it’s actually good? Like how do you give it like a measure of its quality? So there’s been many attempts at this that the oldest and most commonly used still is PSNR where we just compare like the original video or compressed video. And we determine the mean squared error of each pixel through the image and average that together and give us basically an overall number. But the downside about that is it’s very inaccurate in the sense that it’s a very easy number to do. We can compute it easily. It’s repeatable, but it doesn’t really match at all what viewers experience. Viewers don’t see raw pixel number of differences. They see whole objects and textures and walls and things that will make PSNR give very misleading answers to how well a video actually looks.
So there’s been many attempts to make a better codec. And the other thing about PSNR is that it doesn’t take into account any sort of motion. You’re looking at each video frame as it was a separate image. So VMAF is one of the most recent attempts at this. So VMAF basically works by taking several different metrics together, as well as one metric. That’s actually on motion too. So it compares previous to next frames and it combines all those scores together with a machine-learned model in order to produce a final score. And it’s one of the best performing metrics. For example, you get a bunch of humans in a room, you get them to rate the quality of a video from, say, one to five. And say, does it look terrible? Does it perfectly watchable, etc.? And if you correlate that to the VMAF score generated, it has a very strong correlation. So VMAF is currently the best tool that we have for this.
Zoe: Well, I’m always saying that while video has the third dimension compared to image, and a lot of times, the third dimension can be easily ignored. So it’s not just a sliding show because of the time domain. So lots of time, we need to consider the quality along the time. It’s mainly the consistency for frame after frame. And so only poor frame quality many times cannot really represent the whole picture quality-wise, what we do.
Thomas: Exactly. For example, I don’t know if you’ve ever joined in a live stream where you’ll see the quality pulse down enormously, but what happens is that, for example, if you have a fixed bit rate and you send in a keyframe or an iframe, it has to send a lot. There’s no prediction available. So it’s much harder to compress it, but it still to fit in the same number of bits as the previous frames. So you’ll see this awful quality pulsing, quality drops, and then it comes back and then drops and comes back. And PSNR sees that a little bit. But it’s a much more offensive visual issue that you see at pulsing that a PSNR just can’t see. But VMAF can see that because it compares previous frames to next frames. The motion metric inside VMAF is still very simple. And it’s one of several parameters go into the final machine learn model. I think there’s a lot of research and improvement that can be done in the future to a future version of VMAF, for example. That could massively improve that even further. We’re only still at the beginning of doing motion analysis for the purposes of video quality measurements.
Zoe: Yeah, because, for example, your lab, I know, Thomas, your Lab, but you also said that your lab’s contribution to the open source community, and now we also have VMAF, that’s also open source just bring up. And regarding just now, you mentioned like the quality job, one being told as analogy down there, for example, is just like you do the acupuncture. If every needle you feel like that’s good, but then suddenly, there’s one that’s really hurt a lot. And that your memory, you only have that deeply in your memory and forgot all the others. So similar with when you watch a video, if you see sometimes suddenly there is a quality drop is oh, the quality overall, you will not have a good enough impression.
Thomas: Yup. Yeah, having it the VMAF open-source implementation is great. For example, we’ll use that implementation at Vimeo. I’m sure many other people have, it has an FFmpeg filter if you wanna use it inside the FFmpeg command line. Actually, it is called libvmaf, but it also has implementations of other metrics as well. So if you wanna do PSNR, if you want to do SSIM, if you want to do MSSIM, it has implementations as those as well. So it’s a handy all-in-one library you can use with the same API and get quite a number of metrics out of it. So if computing VMAF is too slow for you, you could consider, for example, using MSSIM as not as good metric, but a much faster to compute one if you want to get an enormous number of metrics on a wide variety of videos. So it’s a really nice project to have.
Zoe: And I’m now aware. Thank you for bringing that out. I thought libvmaf, just doing VMAF.
Thomas: Nope, it does all sorts of things. It’s actually when we test in and in A1, when we test video codex and video features, we use VMAF as the implementation of all our metrics. So we have metrics that aren’t VMAF, but we also use VMAF’s implementation.
That’s actually a little bit important because some of these metrics were designed. They were written in defined in like very old now papers. And so either like they had just a definition of the metric written in the paper, or they might have had like a MATLAB implementation. But the problem was that people reimplemented these metrics in their own code base, there were some disagreements and differences when they implemented in on newer metrics. For example, the original version of SSIM was only made to work on 512 by 512 pixel black and white images. And so when people adapted it to bigger images and color images, they decided to do slightly different things. And so if you use it at different implementations of a metric, especially for the older ones that were less well-defined, you might get different numbers. And so having everyone use the same reference implementation avoids that issue because we know that it’s always gonna be VMAF’s implementation when we compare tools.
Zoe: Got it. Thank you for this, kind of like Nathan said. I got to learn.
Nathan: It sounds like VMAF is, before this conversation, I just kind of thought VMAF was kind of a standard static thing, but it sounds like it’s very dynamic. It’s growing, they’re developing, it’s changing. It’s much like AV1, people are contributing to it now. That it’s constantly changing to adapt to all of our standards. Is that fair to assume?
Thomas: Yup, when I say VMAF, there’s actually inside VMAF, there are multiple machine-learned models you can pick. When people just say VMAF for the qualified, there’s like a particular machine-learned model that VMAF 0.61 model that they’re usually referring to. But there’s actually several improvements since then. Since the very first model that was made, people figured out you could actually cheat VMAF. You could do some operations like sharpen your picture and do some other simple operations that made your VMAF score go up. But they didn’t actually make the video look better in way, which is indication that like the VMAF, when they initially trained it, they didn’t train it on a big enough set that included these sort of distortions. So they actually retrained it, and they made a model called VMAF NEG. And that with additional distortions included in the training model and that avoided a lot of these initial, basically blind spots at a model where you could cheat it in ways that clearly didn’t correspond with real video. Real user experiences. That’s actually not
Yeah. That’s not a VMAF’s problem. VMAF failed to include some sort of distortions in their video set and therefore, they had blind spots, but it’s an issue that happens with every objective video quality metric. For example, PSNR, as I mentioned, is the worst one. PSNR wasn’t machine learned model at all, but it still has many things you can do to it that make PSNR go up and make video quality go worse. The most classic example is its blurring. If you do a very strong denoising or blurring filter, PSNR generally hates noise. So you can like smooth out your image, tell it looks like an abstract painting, and PSNR will tell you it looks awesome as a result, but it will actually look terrible. Or, I mean, if you really like the abstract painting look, it can look good, I guess.
Nathan: Yeah, sure. If you’re going for that look.
Zoe: I have one follow-up the final question for you, Thomas. So I’d like for you go back to the time that you’re trying to get some, like, for example, up in the air to do the various communications for video, what kind of technology you would choose at this moment?
AV1 is the chosen technology
Thomas: Oh, I would use AV1, absolutely. But yeah, the great thing is that we do have better radios now, so I can actually squeeze, at the time, I was using 8254 radio, which, if not familiar, has a very little bandwidth, it’s like, 100 kilobits, barely better than dialup. And you can totally squeeze. I could use that exact same radio nowadays, and you can squeeze a totally watchable AV1 stream through that amount of bandwidth, which is pretty incredible. I’ll probably jump to a better radio nowadays, but for AV1, I would actually consider either a software or a hardware encoder. Because there are plenty of ARM chips that have fast enough CPUs that they can do software encode of AV1 in real-time. And there are both hardware vendors now that have hardware AV1 chips. I’d have to do a comparison between those for the real-time use case. And I can think both of those would be totally viable for an FPV platform.
Nathan: Man, this has been fascinating. Super fun to kind of pick your brain and hear what you’re working on. You’re on the front lines, dealing with real-life content here. And so thank you so much, Thomas, for joining us and letting us just hear what’s going on. Thank you for educating us, sharing us with so many of these really cool different things that are going on. And I hope we get to have you on here again. Best of luck to you in the open-source community and in the FPV world as well.