The VideoVerse

TVV EP 25 - Jill Boyce: Video Codec Pioneer

June 11, 2024 Visionular Season 1 Episode 25
TVV EP 25 - Jill Boyce: Video Codec Pioneer
The VideoVerse
More Info
The VideoVerse
TVV EP 25 - Jill Boyce: Video Codec Pioneer
Jun 11, 2024 Season 1 Episode 25

Jill Boyce, Distinguished Scientist at Nokia, is one of the most significant and long-standing contributors to MPEG compression standards. In this episode, we speak with her about her work and experiences, from MPEG-2 through to VVC and beyond. She explains how focusing on what really matters to the users of standards is central to her work, and how having a background in real implementations has been invaluable to her. What does the video world of tomorrow hold for us?

Show Notes Transcript Chapter Markers

Jill Boyce, Distinguished Scientist at Nokia, is one of the most significant and long-standing contributors to MPEG compression standards. In this episode, we speak with her about her work and experiences, from MPEG-2 through to VVC and beyond. She explains how focusing on what really matters to the users of standards is central to her work, and how having a background in real implementations has been invaluable to her. What does the video world of tomorrow hold for us?

Welcome to The VideoVerse.

Zoe: Okay, hi everyone. Welcome back to The VideoVerse Podcast. So for this episode, I'm again, Zoe Liu from Visionular host. And I have Thomas from also our team based in London to join me as a co-host, hi.

Thomas: Hi, there.

Zoe: Yes. And for this episode, it's actually a great honor for me to invite Jill Boyce. I would say we have been friend and actually we know each, I should say I know Jill for long, long while, and I'm very happy recently we got a lot more in-depth discussions, so I like Jill to actually introduce herself and then we can go from there, so Jill.

Jill: Yeah. All right, thanks, Zoe. Yeah, I've known Zoe and Thomas actually for a number of years. I've had like a long career in the kind of video area doing standardization, actually going back to MPEG-2 is actually the first standard.

Zoe: MPEG-2, okay.

Yeah, I actually participated in MPEG-2 standardization a little bit. H.263, you know, H.264 AVC, SVC, the scalable extension, HEVC, VVC. I was actually involved in AV1 standardization a little bit at Intel as well. So, I recently joined Nokia in February after a couple years actually at a startup. I was a CEO of a startup called Merse that was working on kind of trying to convert pictures and images and videos into like 3D video from a regular video. I found doing startup was really hard. It was really challenging and I give a lot of credit to Zoe, what you've managed to do at Visionular. So I was immersed, but the company is actually still continuing, my co-founder Bazil is actually still making progress there, but I decided it wasn't for me and decided to kind of get a job again.

But if we kind of talk about my whole career, I mean, like I said, I went back starting to MPEG-2. I've had involvement in standardization and in product development and in research generally. And so standardization has generally not been my full job. It's only been kind of a part of my job generally. Before I was at Merse, I was at Intel for a number of years where I was in the group that built the video hardware that goes inside of Intel's chips, so the integrated in the CPUs or in GPUs. And so my kind of day job was really related to selecting, deciding features and high-level architecture decisions that would go inside of the chips.

 And prior to that I was at Vidyo, V-I-D-Y-O, with the Vidyo with the Y, that did video conferencing technology and I was leading what it was called the algorithms team that was doing algorithm development and software implementation in the software video conferencing products.

Thomas: So you've straddled both software encoders and hardware encoders. What would you say are the different challenges of those applications?

[00:03:50 The difference of the challenges of software encoders and hardware encoders]

Jill: It's not even just encoders but decoders as well, right? So encoders and decoders, pre and post-processing as well, but hardware and software, number one biggest thing is the timescale, right? Hardware is just such a long cycle, such a long time cycle that, you know, you're making decisions well in advance of what the products will actually be used and in fact in advance of what customers are asking for. You can't wait until a customer asks you for a feature. You have to, you know, I used to say it was like I'd have to pull up my crystal ball and predict the future of what customers were going to want so that we could be starting to work on the hardware implementation that wouldn't reach customers for three years or more.

And so that was, I feel actually, it's one of the reasons why doing both product and standardization I always felt was really useful because in the standards body it's the best glimpse into the future that you have. Because you know what's coming, you know even beyond what comes out but like what companies are interested in particular areas, who's there caring about a topic gives you a lot of information to get some sense of how important something is going to be later.

Thomas: Yeah, and having that insight on how hard it actually is to implement X, Y or Z is actually quite a rare commodity in standardization sometimes. You can get people from a pure R&D background who know in theory that something may or may not be complex, but some things are much harder in reality.

Jill: Yeah, and I think for me having exposure on both the software and hardware sides is actually really useful in that regard because sometimes what's hardest in hardware and what's hardest in software are not the same thing. Hardware is all about worst case, you know, worst case and timing and how fast you can do operations even if you're willing to throw more hardware at it. What is parallelizable, what is inherently serial, you know, some of those issues. Whereas software cares often more about averages than worst case and just, it just has like, it's inherently more serial. I mean there's some amount of parallelization, but it's inherently more serial than it is in the hardware.

Thomas: Yeah, and one thing that is a major difference is that there are particular pinch points in software, like supporting 10 bit video is suddenly a lot more expensive in software, but it's only two extra bits in hardware. It's not that bad.

Jill: Software doesn't have a 10 bit data type, it has a 16 bit data type.

Zoe: Right, so talking about like the software and the hardware, I want to actually still go back to Jill 'cause you are right now at Nokia, right? Still leading, at least get yourself engaged in the standardization, but you just also mentioned all the way back you started from MPEG-2, so just now we talk about for hardware, even standardization, you have to look forward because you have to, regarding whether you have a crystal ball, I believe everybody, at least somebody like you, at least you have vision into the future, so one standing from this point that you are here involved with so we like to learn two things.

One is what you are engaged, so what kind of standardization at this moment, and all the way compared the second one. I'm curious, so all the way back, looking back to when MPEG-2 was built and how do you think from today's perspective looking back at MPEG-2, how MPEG-2 was influencing, why it was being able to, so at least at this moment we all know that HL24AVC has been unanimous and MPEG-2 is guiding a little bit history, but still even for us, we still have customers still using MPEG-2, so it's definitely one of the most influencing standard that has been the history. So we like to learn what you are up to today, and based on today things you look back to many years ago when MPEG-2 was published, why that has been able to make such a big influence in the history.

[00:08:15 How MPEG-2 made such a big influence in the history?]

Jill: Okay, so I'll start a little bit more and more kind of current stuff actually. So it's been kind of fun going back to the JVET, MPEG world after couple years away when I was doing the startup, I just didn't have, it wasn't my focus to participate and so I've had like more acronyms to learn and you know, a set of coding tools. But a lot of, but anyway, JVET now is working on like ECM, you know, the next generation standard after VVC. So you know, that will be presumably NH267, but there's like two efforts now that ECM and the neural network video coding. But the other thing that people might not be as much aware of is there's actually a lot of activity as well going on in like SEI messages for VVC.

So that even though the coding tool work is all primarily on the future standard, there's actually the, so VVC has this companion standard called VSEI. And so it has where the SEI messages which are these kind of carrying optional information to kind of enable target-specific use cases. HDR, scalability, multi-view, I mean just all kinds of use cases and so there's continued activity on SEI messages that, so a new version of VSEI will come out at some point I think probably in a year, in the next year and so at the last, we just had a meeting in April, late April, a JVET/MPEG meeting and I actually ended up chairing a lot of sessions on high-level syntax, which is stuff I used to do a lot in the past and then I was away for a few years and then it was kind of fun to step back in and to kind of a familiar role.

I had contributions myself as well. For example, I had a contribution for region packing, whereas if your application kind of only cared about certain regions of interest inside of your video, having ability, the encoder could kind of extract out those specific regions and then just pack them together in a much smaller video frame and be able to code that and then send information that would allow a decoder to kind of know where those regions were and then if optionally send information if you wanted to reconstruct like a target picture of where it had come from. So anyway, it's just example, so this, you know, there are certain use cases that especially where, you know, machine operation instead of human vision is useful, right, so is an example.

Zoe: Yeah, this is on top of VVC, right? So VVC already the core is there, but then they have actually leave the room with this SEI, with high-level syntax, leave the flexibility so that for the things can actually add on specify still in the standard level.

Jill: Right, exactly, so I mean the VVC core coding tool standard, you know, is efficient for lots of different things. And I mean it also has high-level syntax that's within VVC so I've actually done a lot of work on high-level syntax in terms of enabling different types of applications. When I was at Vidyo, we were doing scalable HEVC as well as scalable AVC, SVC. At Intel we cared a lot about making sure a standard could address a wide range of applications because we were building hardware, we wanted to be able to be targeted and used lots of different places so being able to kind of address the wider range of applications has always been an interest area, an interest area for me.

So there's high-level syntax inside of the core codec itself, you know, allowing encoders to turn off and on tools, you know, enable a particular coding tool, have multiple layers for scalability, et cetera. And so but they're all kind of part of the standard itself. And then the SEI messages prior to VVC were all in the same standard document, the same, you know, HEVC had the SEI messages, but for VVC, the choice was made to put them in a separate document in this VSEI document so that any standard, you know, future standard or past standard actually could refer to those SEI messages instead of them having to be kind of rewritten and defined for every single one. So, but anyway, the point I'm just trying say is like what I'm working on now, at Nokia, it's actually a little different for me and that standardization is actually my main focus so I'm in a group that is actually kind of more standardization-focused.

 This is actually the first time I've been in such a large team focused on standards. I've more often been in the case that, you know, I've been in a product group or research group, and you know, people were working on all kinds of things and standards was just a little piece of what we did or relatively few of us on the team would be working on it. But Nokia has a larger team and participating not just in JVET activities but in other MPEG activities as well. So that's been fun to have like, you know, it's fun to have kind of had both, but I do find that I am bringing some interesting perspective of having had more, you know, product experience than some of the other people that have been more focused just on standardization.

Thomas: Yeah, I think one thing that maybe of some of our viewers might not be aware of is actually how intensive the standardization process is. So, you know, at the height of standardization you might have a meeting taking seven straight days of 10 plus hours-

Jill: More than seven. There were times it was even like nine days or something. I think it's like eight days now is what they're scheduled for.

Thomas: And then you might have an ITU meeting before that and ad hoc group meetings and so on so you're away for a long time. You're working very, very hard in the basement of the ITU tower or something like that. So it's kind of a whole lifestyle. Is that something that you really thrive on? Do you enjoy that intensity?

Jill: Well, it's funny, it's like I've had different phases in my career, like where I would, like, I would do it for a while and then I'd go away for a little bit and then I'd come back again a little bit more revitalized. But I think there've been times like there've been, I have not like continuously been to every single meeting since I started going to them. There were certain, actually certain times, I have young adult kids now, but you know, I had kids while I was working and there were times where I missed, you know, I would stop going for a while when my kids were really young before coming back.

But I mean, I kind of feel like when you're there at the meeting, yeah, you do kind of just really, you're just all of a sudden you're just, that level of energy is higher and I probably couldn't push myself as hard at home for those number of hours, but then when you're there and you have these deadlines and you have things, it is very intense. Something that I think actually just that I'm, having had kind of some of this experience and having participated in standardization over time is that I kind of a good sense of like of where you're aiming to get to, right? Like okay, even like in in a particular what for a particular meeting cycle, you know, particular meeting or multiple meetings for standardization, here's a goal I need to reach at the end. We need to reach at the end of this meeting, or here's a goal we need to reach at the end of a year or two years. What is it we need to be doing now that gets us ready to be able to be doing those things?

I've served for many years as ITUT Associate Rapporteur. People might not quite understand. There's like the JVET has these two parent bodies, right, the MPEG and the ITUT that come together to form JVET. And so one of the parent bodies, the ITUT, so recent we H. whatever numbers is that, those are ITU numbers as opposed to the MPEG numbers, MPEG ones are a little bit more random, but I was Associate Rapporteur of the study group 16 question six, which is VK is the kind of parent body there.

And we would have like, when it was an ITU meeting when there was gonna be actual standards consented, it was the term that ITU would use at the end of the meeting, we'd have to have all these documents ready. And so at the very beginning of the meeting it'd be like, okay, what are all the ones in trying to get people assigned, all the people assigned to make sure people are working towards meeting this goal? Because you can't, like the day before say, "Oh, where's the documents?" But like people would be not only just attending the meeting and listening to contributions and feedback, they'd be doing their homework on the side of like preparing the documents for the standardization.

Thomas: And the discussions themselves can often lead to you doing experiments in realtime alongside the process. I guess that's a bit more difficult with VVC, with taking longer to simulate things.

Jill: Yeah, the experiments take longer, but so still sometimes that happens that there's these quick kind of experiments and there's definitely cases where like multiple companies have proposed something fairly similar and they end up deciding that they're gonna take option A from this company, and option B from that company and option C, and then those three companies have to kind of get together and work out how they would actually all fit together and prepare documents and so there's a lot of this kind of activity. If you look at like the JVET documents, so like, you know, there's a online registry and they're open to the public, anybody can get to them and you look at the document numbers to when they were submitted and when you start getting to the high numbers, you're like, okay, these were things that were done during the meeting based on activity of the meeting.

I mean there's a deadline like a week in advance of the meeting and contributions are supposed to be in before then, but if you see a bunch of contributions that are being registered during the meeting, you know they're in response to activities that have gone on during the meeting itself. Let's go back to Zoe's earlier question, I guess, about the like, if we go back kind of to the MPEG-2 days and things like that, like literally at the very beginning, the very first meeting I went to, you had to actually bring paper copies of your contribution at a whole bunch of them.

Zoe: It's a very different from what you just described.

Jill: You had to like bring, you know, 50 copies of your paper contribution actually. And I think like things like common test conditions were not like defined, bit exactness was not a requirement, you know, so things were kind of a little looser and I feel like as time went on just kind of some of the working methods just got better and more rigorous. I really do feel like the JVET way, I mean JVET had had various in earlier incarnations like JVT and JCT-VC, a lot of stupid politics between the two joint groups between MPEG and ITUT, mostly on the MPEG side that had led to like these teams being renamed but like, you know, been joint teams. I mean H.264 was the first really jointly done. H.263, MPEG-4 part two had a little bit of joint activity, but the H.264 AVC was the first true joint effort between them from the start.

Although even from the start is maybe a bit of a misnomer 'cause ITUT was actually doing pre-efforts before that, but anyway they were together, but I feel like at the H.264 AVC time is where we started to really start fine-tuning the methodologies and things. I think things like, for example, common test conditions, it's just like now we just take it as accepted and just expected, but like that's just so important, right? It used to be in MPEG-2 days that basically people would like cherry pick their test sequences that they would bring. It's like I have this thing to propose and so here are some sequences I have that maybe you don't even have access to that like show how great it is, and you could not do as much of a A versus B that was doing a similar kind of purpose.

And so, I mean, having the common test conditions and having everybody kind of having shared information, you're always free to provide additional data. So if you had a tool, years ago I worked on a tool, it was called weighted prediction where I was specifically good with like fade ins and fade outs, although it's actually useful for other purposes as well, but fade in and fade out that, you know, this came out of personal observation. I would like, I was a very early adopter of like the TiVo DVR and I'd be watching it and anytime like you'd have this fade in scene or fade out scene, I would just go, ugh. And my husband would say, "What? What are you saying?" And I'm like, "Did you see that? Did you see the blockiness during this fade? It just looks horrible." And you know, the prediction from one frame to another was just terrible.

But anyway, the point I'm saying is that I brought, in that case, I specifically made fade in, fade out sequences, brought data for that and showed it and said, "Look here, okay, I tested on the common things, I just disabled the tool, it has no harm, but here I have this extra information where it was better." So you're always free if you're trying to target something in specific to have your own test data that you bring, but it's so important when people are working for coding efficiency that you have an apples-to-apples comparison and everything was tested on the same content.

Thomas: Yeah, and that puts a lot of emphasis on getting the right sequences in your common test conditions and having a whole effort to collate video. It's always really difficult to agree on that and really difficult to get people to contribute video because often the content providers have all kinds of copyright issues that prevent them from giving things. So back when I was working for the BBC, actually we couldn't persuade the BBC to give up any real content so we had to shoot research content to try to contribute to the process.

Jill: Yeah, that's always hard to do. When I was at Vidyo, we shot some sequences 'cause we were working on video conferencing and we contributed some things, yeah. I'm in one of the sequences. The four-people sequence, yeah. But it is always an issue. And another thing that though is like you do have this danger that you're targeting too much to those specific sequences, right? That things are kind of getting tuned too much to that. And so it's always why actually at the end of the standards process, they always do a subjective test with new content that was not part of the original test set. And then there's always like for special categories like screen content. There's activity now actually with like a gaming content. So gaming content tends to be more, it is screen content in a way, but it's a very different than the PowerPoint presentation.

Right? So fast-moving action, a lot of contrast, you know, sharp edges, et cetera, that there's actually in JVET right now for ECM, they are trying to define a test set specifically for gaming content.

Right, so basically this content, or the growing of the content categories also represent how this video technology has been applied to more and more fields and how more and more actually different use cases categories reach up, bring up the request back to the standardization, right? So because of the new use cases, like you also mentioned recently, there are also multi-view, there's the modern, this kind of a media experiences that bring to like 3D spatial videos, 360 videos come to the desk.

Jill: There's so many use cases and I think people who are newer to standardization, you kind of sometimes have to beat them over the head with like, we're defining a bitstream format and a decoder and like encoders are free and flexible and you know, you're defining things to work with an encoder that hasn't been invented yet. And so it just kind of, because it's funny, I've sometimes had cases where especially on some of, if you get out of JVET and you're some of the MPEG-specific topics with people who kind of have less experience in JVET and they'll be like, "Well, the test model encoder doesn't do this." And I'm like, "You know, you're not defining the decoder that way. But you're not limited to the test model software encoder." It's any possible future encoder encoder that has not been invented yet. Just assume that this bitstream appeared from magic. We have no idea where this midstream came from and can we properly define the decoder operation, you know, and get the encoder will know exactly what the decoder was doing.

Thomas: Yeah, that's a particular issue with the high-level syntax aspect of things because actually there's a vast amount of capability that's embodied in that syntax that you cannot possibly test during the standardization process.

[00:26:55 Particular issue with the high-level syntax aspect of things]

Jill: Yeah, right, I've actually been a lot involved in like conformance test bitstream generation. So at Intel for example, like I ended up chairing the conformance for VVC, but it's so important in hardware to kind of get like bitstreams that exercise as much freedom and flexibility as possible and not just what the reference software itself did. And so there was like this whole effort, again, I'm talking about, you know, kind of looking ahead in the future, what am I gonna need? It was like I tried to get very early involved and saying everybody who had a tool that was adopted, you are responsible for creating a conformance bitstream that exercises those capabilities because I knew that in past standards that the conformance effort kicked off a bit too late and by the time they did it, the people had like moved on, you know, new projects, new areas, people had left companies that had worked on something and the expertise to kind of truly know the flexibility of that tool was lost.

And so we learned, and so like in VVC, we just like started that effort super quickly and you know, really just kind of set the expectation that if you or someone has brought a coding tool, it is your obligation to provide conformance bitstreams that exercise the broad range of capabilities of the tool. So with coding tools that's possible. With high-level syntax it's even harder and in particular because there's not the same normative output required.

Right, you know, SEI messages are, okay, so let me say different types of high-level syntax, sub high-level syntax, the things have have turning tools on and off the part that's normative to the standard. Yes, that absolutely needs to be done and we would try to force people to create bitstreams that might be totally inefficient and totally stupid. No encoder would choose to like randomly select every intra prediction direction, but yet, people would have to just modify an encoder to do nonsensical random things to just make sure that you exercised the capability. And in that exercise we would always find bugs in the software, in the reference software so it's just so much better to get that stuff done earlier.

Thomas: There could also be bugs in the specification itself in the sense that the, especially when you combine these things in unusual ways that it's not really what you intended, has been put into the spec.

Jill: I would say I kind of found in the JVET side, at least by the time we got to HEVC, it was more likely a software error than a spec error simply because I think a lot more, I mean I think we just were very diligent about it and there was this assumption that in doubt the spec is right and the software has to be changed to match the spec. When I did participate in AOM and , it was the other way. The assumption was that the software was right and not the spec and so sometimes the spec was not precise enough or clear enough.

Thomas: Yeah, I mean the spec came quite late in in that process. I think in the current AVM process, we're producing things on a much more consistent basis to update the spec from the contributions, but it is still a less formal process than what JVET has followed.

Jill: But again, that has changed over time like within JEVT, like I mean now there's kind of this expectation that you're bringing the contributors bringing the spec text, whereas at one point in time back, even in the AVC days, just so much was left to the editors. Like someone would just kind of generally describe the tools and then the editors would have to kind of figure things out and then we started being more strict about giving us your syntax and we need the semantics and the decoding process also. And the more that I think you put the burden on the contributor, the better off it is because you know, to be sure that they've had to think through, I think it makes for contribute contributions because they've had to really think through what the implications are and it's also so much easier for the group to make a decision because you really understand what is being proposed to be changed and it's hard actually 'cause sometimes there's like encoder-only changes, or even if it's a decoder change, it's really the encoder change that's really making the difference in the results.

But it's only the decoder thing that's going in the spec, right? That's the thing that's going in the actual standard. And so, you know, I find like if I'm chairing, sometimes you have to ask kind of, you have to that clarification question, okay, this is what's in the spec, right? You're adding this syntax element, you're making this change to the decoding process, but you know this other stuff you're talking about you did in the encoder. Well, that that's just the software encoder, right? That can change or whatever. You need to do that encoder work to prove that it has a benefit, otherwise it's not clear that there's a benefit and if there's no benefit, why would we adopt and make a change of something.

Zoe: True, so I just, we have talked quite a bit, but looking into the question I raised, I basically just to see that from early on up to now, and then not only the standards, actually involving representing more different, more different use cases and also technologies also evolving, and now on top of it to how to make the standard. So just now Jill, you mentioned the first thing is that previously we don't even have the common test set. Now this is already take for granted and nobody expect even think about that without the common test set. How we going to work on the standard?  So everybody takes it granted, but that not worth it in the first beginning.

And the second one, you also mentioned that you develop a tool then you want to put the bitstream conforming mechanism in the very early stage so that you can find the issue. You also actually try to manifest and dig out the largest potential of the tools that is part of the standard. And then you also mention that because the standard only standardized the decoder behavior and the syntax, but you cannot without go that without a encoder to manifest the potential of the new tools and what is being standardized there. And now with all of this, it seems like becoming more and more mature and just from the procedure point of view.

But we all see that the technologies are booming. Even right now you mentioned for what you get involved, you mentioned back to the first point, you mentioned you are doing now only ECM, you're also doing the new network thing. Also mentioned that on top of the existing standard, the VVC that seems finalized in 2020 almost four years ago there's still some work that is ongoing. And then I just curious, for example, you mentioned new network, everybody mentioned about AI. So how this new, let's say just the specific for AI, how that actually influence the standardization right now, whether people will said, whether suddenly the traditional standards will be going away and new network pick up, but it doesn't seem like new network a lot of people talking about. We also have other speakers talking about that. So what's your view at this moment? Because you got involved three different work just for standards, so how, yeah.

[00:35:05 How AI influence the standardization]

Jill: Well, I mean, it's interesting like right now, at least in JVET, the kind of ECM is what kind of the more traditional tools, non-neural network tools, and then this neural network video coding is this, you know, using the neural network tools and it's interesting in that like the neural networks, the decoder side is a lot more complex, at least if you measure using the traditional calculations of complexity than the traditional kind of tools are, but on the encoder side, it isn't necessarily so much more complex. It even be, people were telling me that actually the neural network video coding software was actually running faster on the encoding of when you were just doing the inference part. If you weren't doing the training, if you just inference it was, it could run faster because the ECM just has so many tools and options.

So there's no question that neural networks are being used and will continue to be used for encoding algorithms. You know, you hear a number of companies like talking about products with it. I think you guys have even talked about that with like products. But will there be things in the standard? So I mean, a neural network post-filter, you know, there already is some SEI message to be able to have information about a neural network post-filter. It's uncertain whether like the next standard, the H.267 will specifically have standard-essential neural network tools that the decoder is required to do. My best guess is that there'll be a profile, that there'll be a profile that'll have neural networking tools and then there'll be profiles that don't, that allow some kind of freedom for both.

So what would a neural network, a pure neural networking codex specification look like? Because these are often implemented with floating point with thousands or millions or billions of weights and so on, so what are you even going to specify?

Jill: Yeah, well, they're actually, in JVET, there's actually like this software library that allows you to do kind of a bit exact fixed point neural network implementation that can be done for then comparisons, strict comparisons, but it it runs slow and so I think actually the whole bit exactness on the decoder is one of the real challenges of the neural network because they tend to be floating point and tend to not be repeatable and so how do you kind of, how do you define that? And I think even like if you start having neural network processing of things, it start affecting quality perhaps in different ways and so I think actually going forward, I think there needs to be a little bit more use of different types of quality metrics that consider subjective quality and in some cases it's, you know, the traditional metrics we're using now are much more about fidelity to the original sequence.

But when you do other things like, I mean, film grain is an example that you've like, you've kind of modeled film grain and you've added it afterwards and using a metric of fidelity to the original is not no longer the appropriate measure if you were gonna measure a film grain quality. It's more about the kind of look and feel, you know, how much texture there is and so I think for some of these neural network things, we're gonna have to have more different quality metrics that can really just, does it look reasonable? Does it look right?

Do people have the right number of fingers?

Jill: Yeah, the number of fingers or you know, does the face look like a face? Does the water look like water? Does the grass look like grass or does it look totally smooth? There's actually an effort going on in JVET that's called this generative face video SEI. I haven't been involved in it myself, but I find it fascinating where if like if you wanna be coding video that was like AI generated just to be able to do it very efficiently because it's like known that it was generated by a model, but they're using some like different metrics for measuring the quality of it that are more of like kind of, you know, how does the amount of texture compare generally and less about fidelity to something because the thing that you'd be even comparing it to itself was artificially created so it's not the, the same kind of, you know, that full reference quality metric, you know, PSNR-MSE base, you know, even SSIM, you know, these are things that are full quality reference based are not necessarily the right thing in all cases.

Yeah, and it throws up some interesting questions if you are not even trying to get pixel-by-pixel accurate. You know, if the motion is not exactly the same if, you know, the texture is not exactly the same but it's somehow statistically very similar, there are kind of philosophical issues that like in security footage or when you want to verify that something happened or didn't happen or did the football cross the line, you know, if the motion was modeled in a certain way, these kind of things become much more involved.

Jill: Yeah, that's why I feel like, you know, you have, you define these coding standards that have a toolbox and you kind of pick and choose for your particular use case what's appropriate so maybe a surveillance video would not choose to use certain tools that a TikTok video might. They just have might have a very different, you know, specific kind of requirement and so we try to make the standards be future proof and have that ability to have the kind of core underlying thing similar so that, you know, especially all the stuff that's really touching the pixels a lot that you can have a hardware, you know, hardware implementation that's gonna be efficient and then just, you know, but the toolbox, a little bit of a toolbox so that can target specific applications based on their specific needs.

Zoe: Yeah, I was actually really, I mean pleasant surprise to hear that you made a connection between the film grain synthesis with the neural network coding right now, for example, especially involve the GenAI 'cause I exactly as you mentioned, your thesis as a grain is actually could be really just pixel-by-pixel different from the original, but look similar, so the fundamental issue just now you touch is how do we validate what kind of a quality metric we have to develop on top of all these new things? And then especially human eyes may compare them not just machine pixel or pixel for grain. They just think, okay, they're very look like the film grains, the size and strengths are very important and will make the human eyes feel different. But for all these new things and then what kind of quality metrics need to be developed in that kind, it seems very critical.

Jill: Yeah, no, it's super critical. I'm not an expert specifically in that, although I actually, another SEI message that I proposed at the last meeting and is in this technologies under consideration document is to be able to signal in band with the content quality metric for individual pictures. So it could be carried in band and then it's kind of got defined some preset ones that we all know like, you know, PSNR or MSE or SSIM. But then it also has the ability to have like a user-defined quality metric, so for something that's not specifically standardized and specified by JVET, but that you can kind of just give it a name or a tag URI to kind of indicate where it's been defined. And so it's like it allows you to indicate like the encoder as to, you know, which frames are better than others and so especially for like, but you talk about like human eyes versus machine purposes.

Yeah, like certain cases machine purpose, knowing this particular picture was higher quality in some way, this might be the one you choose to do your machine learning task on. I wanna do a facial recognition, well, maybe I pick, this is kind of the hint of, you know, we can view all these, but this is the one maybe that's gonna be better quality you might wanna choose to use to do your facial recognition on.

Zoe: And this I also learned quite a bit about ICI. It seems that it really just a facilitate the existing standard to have a way to actually to open new options and new opportunities actually bring back to the existing standard even though it seems already finalized, but leaving the ICI make is just scalable and then to extend to new requests.

Jill: Yeah, it was interesting you mentioned the film grain stuff because actually way back when, when I worked at Technicolor, I worked on film grain, actually with, you know, Christina Camilla and we had made original proposal to the joint team back then to standardize film grain and it wasn't adopted back then. It took a little bit of time but eventually there was an SEI message and eventually became part of the Blu-Ray and HCDVD specs, but it's interesting to see how the film grain has continue to move along and you know, become on the AOM on the AV1 side becoming, you know, a mandatory feature. But it's interesting but it's like it is just such a potential savings, right? It's just grain is so noisy and you know, it's like our codex don't like noisy, right? Our video codex like correlation, right?

Thomas: Yeah, so there can be huge savings. I think the pain point that we experienced, you know, with the AV1 and film grain, that will work itself out eventually though, is making sure that people do it. So the risk if you are a VOD producer is you can save half, 2/3 of your bits by getting rid of the film grain and synthesizing it, but if some, if TVs don't support it and it's slightly, or it's ambiguous, then you may get a bad impression from many of your viewers so people need to do it.

Jill: Yeah, you're afraid, like you're afraid of like what if it doesn't get done? It'll look worse than if I had, you know, it will not have the look and feel that was desired by the content creator.

Thomas: Yeah, so there's then this kind of negotiation with hardware vendors because they're the guys you really need to persuade to do it. So I remember in Navy 1 there were these discussions about having the two conformance points really on the hardware side, you wanted the two conformance points, so you had the option not to do it and they wanted to wait until, I guess, I might be putting words in their mouth, but wait until the market had really spoken that it was needed so you always have this dance around these kind of issues. You know, is it really needed? Okay, then we'll really do it.

Jill: Yeah, but it's always a little chicken and egg. It's like, you know, will the hardware do it? Well, if the hardware doesn't do it, the encoders still have to include the grain and then they'll be like, "Well, I don't have to do it. All the bitstreams have grain in it."

Zoe: I have been actually influenced by all these discussions, not just I want to mention one point as we actually already spent almost 50 minutes and even more and talking about all this experience and technologies that Jill you mentioned. So back to your own experience, I know everybody will always ask, the same thing they actually something that they mention me 'cause they say okay, you are female, Jill, you are female and then we are all actually part of the women in streaming media community and I want to actually bring one thing.  Actually talking to you, so especially when you mention you were doing those fade in, fade out, you couldn't actually bear those artifacts. That's why you got this weighted prediction down there. And just now you also mentioned look at the grain. So the way after you press I was really in influenced so I can see the patience down there.

My question actually, just to go back to right now you are working in a group like everybody doing standard, but you also bring your product and all these application have experiences. So it seems like you join a group of standards, group that everybody working that, but you do have your unique things down there and then you are female, right? So I just want to actually, I don't want to in general talking about being female, what's the challenges down there 'cause a lot of people have addressed that, but still, how do you actually keep the passion and your influence to others?

[00:49:26 How do you keep the passion and the influence to others?]

Jill: You know, I mean it is interesting. I think it's like, I guess I always just, I kind of just find working on new things always very exciting. And so it's one of the, you know, I just, for me, I mean, even though I've been involved with all the, with coding tools and this and that, for me, I always get a little bit more excited where it's like something that's enabling like a new feature or new experience rather than just, you know, 0.01% coding gain that you just only see in the number. That I've always been a little more interested in like, you know, again, whether it's film grain or fade ins or outs, I did stuff with 360 video scalability. I've personally always been very interested in like, you know, new use cases, new things that are more visible and can be seen rather than just didn't down there hidden in the like in the numbers.

But that stuff is super important and I don't wanna in any way say, I mean I think it's super important and I really respect the people that do that, but I don't get as excited by like sitting there and doing the software implementation and the testing of every little possible thing. What if I do this little intra prediction direction, if I do this specific one or whatever, I get more excited by the, you know, the higher-level things and just the use cases and I kind of, I actually get really frustrated that so many people like in the group don't even look at the videos, you know, that it's like, you know, like we have to, we ultimately, I think it come down to people who have only done standards and not products. I mean products, it's like you need your video to look good, right? That's what the end users are experiencing. The video has to look good, whether or not your PSNR is better is not the same as is your is does your video look better?

And so I kind of just find that kind of aspect of that might actually influencing a human. Am I actually influencing creating a new use case? Am I, you know, enabling some new usage that we hadn't anticipated? There was something years ago, there was a set of slides that Gary Sullivan and I think Aaron Gill had the kind of had put together and it was about standardization philosophy and it was something like, you know, about if like if you're really lucky, the standard that you're defining here will be used and it'll be used in ways you never imagined right now.

And I think, and that was in the AVC days and I think that's undoubtedly true, right? That AVC is being used in ways and for use case and applications that we would've never even imagined back at that time. And so I think the thing that me, that kind of gets me excited and you know, wanting to come back to work every day is just being able to think about how do I enable some of these different actual use cases that will actually affect people or affect products or, you know, doing things like that. It's not just about like sitting in the corner and doing the math. It's writing the program and running the test. It's about like enabling something that people will use.

Zoe: Yeah, you basically mentioned two aspects. Maybe they're very closely fundamental to even the same things you mentioned first innovative, innovations that make you feel excited, but more important, you also mentioned that what you have done is actually does influence the end users, the people's daily life experience, right? So you can see that that's why you mentioned the video eventually, not for machines, but for human eyes and only when human eyes feel, okay, this is pleasant videos and then not just look at the numbers that they look better.

Jill: I do wanna touch, can I touch on one thing you don't get, but just the whole, the being in the woman in the field, I do wanna at least touch on something there, but I will feel like that being a woman in a male-dominated field, like in the video coding, I think it's a double-edged sword because you're just more noticed, right? You're more remembered. And so it's like the, you know, so if you do something bad, everyone remembers it, or if you do something good, everyone remembers it, but I think it just kind of makes you more noticed and so I do think that like, on balance, that is perhaps a benefit for me because I can't tell you how many times I'll like meet somebody and they'll be like, "Oh, nice to see you Jill." And I'm like, oh, I can't remember who you are. And it's like, oh, we met six years ago at this conference. And I'm like, yes, I was there at that conference. But it's like the problem is that just, it kind of maybe I'm not good enough at distinguishing and remembering the names and the faces of things, but it's like that standing out more, it's like, I guess 'cause like I'm more unusual so it's more easy to remember me just purely from demographics, you know?

Zoe: I completely agree with the double sword. Sometimes I feel like, okay, so when you're doing good, people think okay, you are exceptional. And then when you are doing something not as good, it's actually sometimes okay, they may in their mind say okay, this is aligned what they have thought this is a female leader that they may just do something what has been expected. So it's just like add another simple point to their existing evidence.

So based on what you just mentioned back to innovations and changing the use cases, we still want you to, wrapping up this episode, we want you to actually look a little bit into the future. So you're doing this standardization, right? We all mentioned that we need a little bit of crystal ball, but yes, you mentioned a lot of times our vision based on what we actually deep understanding of what has already been happening, so we like to learn a little bit what you think about with all this so many use cases, what will be happening on the video because will be the potential it's already made. There's new experiences like mentioned 360 and then multi-views, even generated videos. And then, so the usage definitely will drive the technologies, but on the other hand, the technology become enabler of new usages. And so what do you think the future, the next few years?

[00:55:50 Video world of tomorrow]

Jill: I think the future is, well, I mean, I think for sure there's gonna be, you know, this next generation of video coding standard and probably on both sides, right? It's like I know AOM is working on Navy 2, but for JVET, you know, there's definitely gonna be a next generation standard and it's definitely gonna be more coding efficient than before and I also hope it kind of, I mean we always try to make the standards, you know, address a wide range of use cases so I think again it will, and I think you even found that like even like VVC, you know, we had that versatile in the name because of just trying to be aimed at this wider variety of use cases and I mean I think and like H.264 AVC days, you know, there were some extensions that were done later on and some of those extensions weren't even originally intended to have necessarily like scalable video coding wasn't originally intended to have been, you know, attached on to AVC. There was this whole completely separate effort. And then by the time HEVC days is like, oh well yeah, a lot of these extensions will do things as an extension to HEVC instead of a completely different standard.

When we got to VVC, we kind of right away thought, okay, we know we need to support scalability and multi-view. And then, you know, there was, you know, the range extensions of trying to have higher bit depths and things like that and so for H.267, I'll call it for lack of another name, I mean, I think one of the, I think the machine vision is one of the things that we're kind of more aware of, gaming content, you know, more aware of and it won't, maybe it'll just be high-level syntax changes, maybe it will be a specific coding tools that are good for that that might be in a main profile, might be in a different profile. There's always this huge fight between like everybody wants exactly what they want but no more in the main profile because you know, it's like the main profile is the one that gets widely implemented and so if something is in a different profile that isn't implemented in hardware, you can't like use the tool.

So I would expect there'll be continued to be fights about like what goes in the main profile versus what is in a separate profile. Like I said, the neural network type stuff. I mean, I just, I think again, I mean I can specifically say the gaming content, even generative AI type content, machine vision consumption, those are things we're gonna drive. The whole requirements activity is actually just getting kicked off. At the April meeting there was a first input contribution that had to do with kind of requirements for next generation standard.

And so, we'll find out more about, you know, in the coming meetings that was specifically said, let's kind of start encouraging input contributions to discuss requirements. But I think I can say that there will be kind of a broader range. You know, we kind of generally don't throw away the ones of the past. You know, we still need to support video conferencing, low latency applications. We still need to consider broadcast and streaming, so I think that the set of use cases doesn't tend to go away but just kind of adds on and so I think there'll be, you know, discussion about those requirements, a focus on some those, and then if we're lucky, a standard will get developed that will be used in ways we could not even imagine right now.

Zoe: All right, we actually, at least for myself, really want to invite you again down the road and to listen to all these things, especially you mentioned that just define the requirement because you want the requirement actually include what's going to happen next and whether going to become something and with even larger potential so we like to hear what's happening especially.

Jill: Yeah, and I'll specifically say, and if anyone who's listening that watches, listens to the podcast, you know, contact me if you have requirements that you think we're not that, that you think are new ones that should be addressed, I'd love to hear from people who are working on products. That's actually one of my kind of, as now that I'm in a group that's more standards-focused and I don't have the direct product groups in my own company as much that I'm talking to that I really want to kind of be speaking to a lot of variety of people in a variety of industries about saying, "Oh, you know, bring, I wanna understand your requirements to make sure the new standard can address them."

Zoe: All right, we will definitely highlight at least this one, say Jill is a core for more wider requirement and feel free to contact her. All right.

Okay, and then we are here to wrap up this episode. I'm grateful actually to have you down here, Jill, and I do hope that down the road we have more discussion like this and invite you back to our episode, thank you.

Jill: All right, thanks very much.

Thomas: Thanks very much.

Jill: Nice to talk to you guys.

Zoe: Nice to talk to you. And I also thank you for everybody who is listening to this episode. So remember that Jill is actually here waiting for more requirement in the video field is going to be definitely more exciting time to come. Thank you, everyone.

The difference of the challenges of software encoders and hardware encoders
How MPEG-2 made such a big influence in the history?
Particular issue with the high-level syntax aspect of things
How AI influence the standardization
How do you keep the passion and the influence to others?
Video world of tomorrow