TVV Ep 14 - Codec for AI, Codec of AI, Codec with AI

In this episode, Professor Maggie Zhu and her student Zhihao Duan describe the state of the art in AI codecs and their diverse applications. AI compression can be applied to video three ways: augmenting conventional video codecs, completely replacing a conventional codec and also to compress the AI models used for various video applications. They discuss how the proliferation of lightweight video-capable devices is driving new applications and changing the balance between encoders and decoders, and how their novel architecture can beat the state of the art. Watch / listen to see where the next generation of technology is taking us!

Watch the full video version.

Learn more about Visionular and get more information on AV1.

[Announcer] Welcome to "The VideoVerse."

Zoe: Hi, this is Zoe. So welcome everyone to be back to our podcast, "The VideoVerse." And this is going to be very special episode. Lots of things different from the previous episode, 'cause this will be the first time that we have invited the academic people to join our podcast talking about video technologies. And it'll be very, actually I have been really looking forward to this episode. First I'll introduce my co-host, Thomas. He must be very familiar because he already has been in one of our existing episodes.

So today we have Professor Maggie from Purdue University at West Lafayette, and we also have her student, Zhihao to join us. And it's very interesting episode in another meaning that Professor Maggie joined us from the East Coast of the United States and then her student, Zhihao, right now is visiting China, visiting Shanghai. So he is joining us in Beijing time. Actually we need to congratulate Zhihao, because he just got married. And then, my co-host, Thomas joins us from London and I'm here right now in the Bay Area following, observing Pacific time.

So right now we're actually really widely spread. So you can see that I'm here in the morning, and then we have London right now into the afternoon, East Coast is actually already a very bright morning, and then it's late evening in Beijing time. So it really shows that we are very globally distributed in our podcast. So today we're going to again talk about video compression, and actually about AI applied to video compression. But first of all, I'd like to hand over to our guests to have some introduction about themselves. So Maggie...

Maggie: Thank you, Zoe, for the introduction. So hi everyone, I'm Maggie Zhu. So I'm an associate professor with the Elmore Family School of Electrical and Computer Engineering at Purdue University, the West Lafayette campus. Very nice meeting everyone in this format. It is my first time, but I really look forward to it.

So our group mainly work on research topics related to image processing, computer vision, video compression, as well as digital health. And the topic that we would like to discuss today is a little bit special. It is more of an emerging area and known as coding for machines. So the idea here is that more traditionally we think about multimedia signals, not just videos. It can be speech, audio, images or even point clouds, live fields, these type of signals, that have mainly been acquired, processed and compressed for the purpose of human use.

But however, because of the increase of majority of the internet connections and then traffics are now more dedicated to, from the machine-to-machine type of framework. So how do we address this issue with increased data communication across network? And if it is served, if it is for the purpose of automated machine analysis, then how do we think about this new problem and how do we think about it in terms of video compression? So this will be the topic that we're going to discuss today. So I'll let Zhihao to introduce himself now.

Zhihao: Yeah, so, hi everyone, this is Zhihao. I'm a PhD student, the third year PhD in Professor Zhu's Group. And so as Professor Zhu said, we work on various things. And for me I work around coding plus AI, so in two ways. So we have coding with AI, so we can use AI techniques to facilitate, to improve coding performance, and we have coding for AI. So, we can design coding algorithms specifically for AI algorithms, but not for humans. So this is my research scope.

Zoe: Well, thanks, you are still in your honeymoon. I'd like to add a little bit more 'cause we invited Maggie and Zhihao and then we need to congratulate them that they actually just won the Best Algorithm Paper Award and they just finished in early January of this year in the WACV Conference. And I think they can talk about it little bit, because, actually, the paper draws attention to us. 'Cause the paper is also talking about, as introduced by our guests, applying AI technologies to video compression and our image compression in the large.

This has been, we also have episodes talking about how AI can be applied to video codec, 'cause right now machine learning has been widely applied to everywhere, especially recently there's a wave talking about GBD. And so there's AI machine learning can be applied to almost every aspect, but for image and video compressions, I think our guests have also touched that point. It's a very complicated, relatively complicated procedure. We also have prior guests talking about, there's a lot of research ongoing, but a lot of people also hesitating that whether machine learning or AI could ever be applied, would ever show advantages, compared to the traditional 2D transform, plus motion as a machine compensation platform.

We're talking deep tech here. So basically image and video compression has a traditional framework that has been applied for all these many years, I think, since the very old codec standard to the most recent one. For example, we now know that H.264, AVC and then H.265, HEVC now, VVC, a.k.a H.266, And on other trend lines for open media that have AV1. All these video codec standards produce different formats, but underlying, they actually follow a very similar framework. The image compression is similar. There's a transform applied to the images, block-based prediction have been applied. So there's a traditional framework that has been leveraged on there, but how AI, or machine learning, can be applied on there.

I think Zhihao just mentioned a very important ideas that sometimes, image or video compression, not just making compression of the images to be seen by human eyes, but also can be applied for the ultimate goal for the image to be consumed by machines. For example, I think, Zhihao, you can talk a little bit about the title of your paper and the main theme of the paper that you have just won the Best Paper Award. Actually also have one more paper that's won, at least entered the top five as the paper finalist for the December conference, Picture Coding Symposium, short as PCS. I believe PCS right now is one of the most prestigious video and image compression conferences in academia as well as in the industry.

So can you talk a little bit about the best paper work that you've done? Also, you just mentioned that there's motion learning applied to image codec part for the sake for the image to be seen by the machines instead of by human eyes.

[00:09:00 Coding for AI]

Zhihao: Yeah, so we have two papers published in the last two months. And so as we said, we have two scopes. So one is coding for AI, and another is coding with AI. And maybe let me talk about them one by one, so let's talk about the coding for AI. So in this problem, so we are interested in, we want to design an encoding algorithm that is optimized for AI algorithms to execute on the bitstream, but not for humans to look at the reconstruction.

So if we look at all of the traditional codecs, like for images we have like JPG, JPG2000, BPG, et cetera. For videos we have like H264, H256, AV1 and VP9, et cetera. So all of them are compressing videos to as few bits as possible and they want to reconstruct the image or video to pixels to their original form. and we want them to have a good visual quality. This is very normal in the past 20 years, but now many video and many images, I believe many data on the internet they're transmitted not for humans to look at, but they are used for machines to analyze.

For example, let's say we have a drone. We have an unmanned aerial vehicle, UAV. It is operating in the sky and there is a camera in it, it captures some videos and we want to do some, let's say object recognition on it. So the way it works is it captures video and it needs to transmit the video to the backend. It may be a edge server or it may be a cloud server. So we want to do it because running an AI algorithm on that machine, sorry, on that UAV is expensive in terms of power. So we really want to like compress it into bits and transmit it into the backend for analysis. And on the backend, there's no human to look at that video. There's only an AI algorithm to analyze it.

So in this case we really don't care about the reconstruction PSNR, the metric we always use in video compression. We really don't care about it in compression. We just care about the AI algorithm's accuracy in the backend. Yeah, so this is what we'd call coding for AI problem. And we have one paper on this and the second scope we care about is coding with AI.

So as Zoe said, like in the past decades, virtually all of the codecs are based on transform coding. And yeah, so in the past few decades, to my best knowledge, virtually all of the codecs for images, for videos, they are based on transform coding. So there's a, let's say, DCT or DWT. Discrete cosine transform, discrete wavelength transform. There's a transform to transform the pixels into the frequency domain and we quantize and code the frequency coefficients instead of-

Thomas: So do these, sorry to interrupt you, do these AI algorithms work in sort of similar ways or are they a generalization of this kind of architecture? Do they have things like quantization and approximation in them?

Zhihao: Yeah, that's a very good point. So I would say it works in a similar way, but it's more generalized, because if we think about the essence of compression. How compression works is that we can find frequent patterns and infrequent patterns and we assign fewer bits to frequent patterns, and more bits to infrequent patterns. So that is how entropy coding works, and it is also the, the foundation of how compression works.

[00:14:01 Machine learning and AI techniques in the coding pipeline]

Maggie: Yeah, so I think, one of the challenges with some traditional coding is because there's a lot of parameters to tune. And it is almost like an art that you have to master, and it becomes very difficult for someone to even try to get into the area because they will say, hey, here is the latest, greatest standard that you can use, and then go use it. But how? So there's a lot of training that is involved, which is, it is a lot of engineering designed into it and it is very dependent on the different applications. So that's why you have many different configurations, different kind of models that are being introduced in the traditional codec, because it is tailored to achieve that best RD for the type of applications or even content-oriented.

So I think this is one thing where AI techniques could help or, of course this is just one aspect of it. But if we can learn something about this data, it needs to be compressed, whether it's image, video or something else. But now it has the leisure of large scale datasets. So this is something machine learning can leverage. It's from the large scale dataset, then you can perhaps, alleviate some of this burden on having to try to manually find what would be the best solution for your particular application or the particular content.

And if you think about it, this is not really a brand new idea, because, for example, even looking at works such as quality metrics. There's the VFM and then the reason why is because it also knows about when PSNR works better, when SSIM works better when VIF works better. So, again, it is leveraging having seen a lot of different datas and then knowing, when in what scenario would you use what type of parameters. So I think this is one of the advantages of introducing or including some of these machine learning and AI techniques in the coding pipeline. So that's just my 2 cents.

Thomas: Would you say in these types of applications, these machines-to-machine applications, there's maybe a different kind of balance? I mean, traditionally video codecs have been designed around sort of one-to-many architecture. I mean, video conferencing is is one-to-one, but, in VOD you have a single encoder that can be very expensive because it's serving many thousands of decoders. So the balance is many decoders, a single encoder, but are these applications more where it might be the other way round that there might be many encoders and a single decoder? Something like surveillance or security or those kind of applications.

Maggie: Yeah, exactly, so that raised a very interesting point, Thomas. Because if you think about it, it really depends on what kind of applications and then the type of hardware you're working with. Because we're talking about these machines that have huge GPU capabilities. So, encoding, expensive encoding, that's fine as long as it gives you the accuracy or the rate that you're desiring, that's okay. But then you also have very, very small devices where it may only have a little bit of capability for a couple of neural network layers.

So what do we do in this situation? We can't spend a lot of complexity built into the encoder site because of what it can support. So there is, I think, a lot of dynamics, adaptability that the system needs to be built. And that's one of the reason why if you look at it, the different standard work like the MPEG-JPG, they have been quite active forming these ad hoc groups. Looking at video coding, for machines, for example, was MPEG and there's an AI ad hoc group was JPG too. 'Cause essentially this becomes quite challenging to think about from a standardization point of view.

So yeah, there's a lot of, I think, interesting work that needs to happen, not just for us who are in academia to think about, well, how do we best solve this problem? How do we design the algorithm that can serve these different purposes? But also from the more engineering or the industry point of, well, if we have this great algorithm, how do we deploy it in our system, in our devices? So I think it's, although it's very exciting, but it is also very complex and a very difficult problem.

Zoe: Yeah, I think that's why we're talking here 'cause there's a lot of suspicions, actually, from the industry that we, one, whether the use of machine learning or AI can ever benefit the codec world, image codecs or video codecs.

But talking with you guys previously and we learned that, sometimes, for example here, Zhihao talked a lot about transforms. I heard it once mentioned that at one time you were trying to get some new algorithms that transform, and then submit some paper to the academia area. And then some reviewers challenged that you should put the transform within the real codec framework and then do some benchmark to show the advantages of the algorithms. But indeed an image and video codec, if you really want to go down as what standardization or the industry are trying to evaluate, it could be hard, because there it's kind of complicated.

So if you design some new algorithm, Zhihao, for example, you have some new algorithm and then you are, it seems, obligated to put that module within the whole framework. And that takes time. So what do you think how the academia can really advance some algorithms without actually being hindered by a lot of, seems the industrial criteria? Because we need pioneer work from the academia so that it can guide or give us some insights eventually and benefit, for example, the codec industry. But on the other side, the industry has a lot of suspicions, and then so they want to put a lot of policies of use, hoping the academia paper will follow some benchmarking. So what do you think?

Zhihao: Yes, so first of all, that's a very good point. So I do agree. The learning curve for codec is very long. So once I was looking into the VVC codec, but the document was more than 50 pages and it's very dense. And it took me several days to run it for the first time; to learn the documents and run it for the first time. So, yeah, it does have a long learning curve. And regarding the-

Maggie: Well, that also doesn't include making changes. Sorry for the interruption, I just wanna make a point. So Zoe mentioned about the story I told her about. It's not actually our group's work, it was someone that I know had also been working in this area for a long time, but no longer as active because, they wanna propose something new. And this particular still within the traditional framework of, let's say, perhaps a new block transform. But the problem is that anytime you wanna make a change to the transform, it is the core. So that means that everything else downstream will have to be changed and how do you do that? How do you do signaling, how do you do entropy codings?

So all of that's not gonna fit in the existing block-based transform paradigm in any modern codec. So I think it's because how everything is so related, a small change in one place, perhaps you would imagine it's a small change, it could mean many, many changes in other places. And as academia, as we publish papers and we get reviewers and the reviewer has a different background. Now, if they don't necessarily come from the coding community, they may not understand. They could say, well, yes, this looks good, mathematically you proved it. It's better than existing ones, but I can't see barrier savings. I don't see the RD curves, I don't see your your BD-PSNR. But in order to do that you have to put in an existing coding framework to be able to show that.

So I think that's where, from the academic point of view, it's hindering and that's why people who might be working on this for a while and then just feel hopeless. So that's one aspect and I just kind of wanted to point out that the challenge is not necessarily because in academics we only wanted to pursue papers and then we don't care about how we could trans, like I would say it's more of like how do we deliver what we think is great into an actual product or deployment? Or even before that to have people agree with what is right or what is a potential pathway. So I think it's important. We need understanding from both sides. So go ahead, Zhihao.

Thomas: Yeah, one comment. Having been participating in standardization, I would also say even within standardization, it's very difficult to make changes because you are in a local optimum and there's lots and lots of moving parts. So it's certainly very hard if you don't have the deep immersion in the codec standard you're developing or you're using. But you can find something that should work, doesn't work because many things automatically adjust inside an encoder, so it can be super hard.

I think one of the things that's really attractive and interesting that some of this AI stuff is that, in a sense, it can be simpler. That you have a lot of layers and then you have automatic training and it's the training that takes care of all the very complicated hand tuning that you've talked about. And so that's why there's, maybe we would get a fundamentally different architecture for future standards that is much simpler and less, and yet more general with these kind of layered solutions.

Maggie: Yeah, and I think one of the things, a little bit kind of deviation from this but I think also is important, is we wanted to push for at many of these high profile conferences that we have attractions, such as workshops, for example, or special sessions where we attract people from both academias and industry to meet and then to talk about these emerging areas. Both people from industry who think about how do I do this from the hardware point of view, and people from standardized bodies and say, well, how do we standardize this? So there needs to be opportunities and venues where people can come together to talk about these things. So that's sort of, I think another, a way to bring people together.

So I think the challenge here is sometimes it's difficult. People work in industries who work with products, they go to certain conferences, they go to certain meetings, but academic people, we go to academic conferences. So there is not a lot of opportunity to talk and chat about this. So I think that also needs to have a better environment to make sure that these emergent topics can be discussed. We can see the different challenges early on, so people can work on these together.

Zoe: Yeah, so you basically take the request from the industry and then can really just be aware of what the real issue is down there and provide the solution from academia. And that's very needed. But back to, 'cause we all talk about, there's quite some challenges down there overall, and then back to our original topic that we talk about the Best Paper that you just won in the WACV. 'Cause this conference, in my impression, that is, I think, is one of the top also computer vision conferences.

And there's only one, I think, they have two tracks: algorithms and applications. So there's one paper of each track, but the majority, actually, of submitted papers were in the track of algorithms, and you won the Best Paper of the algorithm track. In one, sense because this is a computer vision conference, but the best algorithm paper actually is focused on codec. So we do see that, at least in this trend, AI applied to codec actually draws this majority attention.

So I'm just curious; back to the paper itself, what kind of algorithm in general that you proposed, and what results that you presented to show some promise for the AI plus codec topic?

[00:29:21 AI bridges the gap between many fields]

Zhihao: Yeah, so, first of all, thank you for introducing our paper. So in that paper we, so we first identified the relationship between compression or coding with a generative model. So a generative model is a kind of AI model that learns the distribution of data. So it is fundamentally related to coding and compression, because both of them aim to find patterns in the data. So I hope this made sense. But I just want to highlight that the introduction of AI brings a very, sorry, let me do it again.

So yeah. So I just want to highlight that AI bridges the gap between many fields, including coding and computer vision and machine learning. So I think maybe in the future we can see that there's a single method for both coding or compression and image generating. For example, I think, I guess maybe we know the diffusion models, the stable diffusion. Like the AI art models. Sorry, Profess Zhu might...

Zoe: No, worries. 'Cause there's a lot of curiosity. They are just, from the industry and standards community.

Zhihao: Yeah, because I think there are a lot of technical stuff and I don't know like what's-

Maggie: So, okay, so maybe, yeah. So maybe one way to think about this is, I mean, the goal here, if you look at, so for this particular paper, what we're proposing are to kind of think about the coding from the point of a variational auto encoder and that's one type of generative populistic model. And then, as Zhihao says, that it can learn the statistics distributions from the data, so in our case, that would be images. And we actually did it for image. We have not quite done it for video yet, because as you all know, they're much more complicated 'cause there's also a lot of redundancies in the time domain. So that's something, yes, we're working on.

The original paper in this conference is more focusing on the image. So if you think about, all it tries to do is how do I transform data from the pixel domain to some kind of latent future space? But if you think of it that way, it's really quite similar to, in the traditional coding, the transform coding. But it is done in a way that, and then what we are able to show is that using this variational auto encoders, we can learn the rate distortion function of the data which is quite fundamental for lossy compression.

So I think this is where it brings new insight to the community and for people, I think, as Zoe mentioned, this is a computer vision conference, so the audience are familiar with variation auto encoders. So we're not proposing a new neural network architecture or anything, and it has been shown really great for generating things, or transform from one type of things to something else. So these have been shown, but I guess what's new is people haven't thought about its relationship to transform coding, for example. But because we work on compression, we see there is that connection. So I think that's where this paper got noticed and people appreciate this new insight that we brought to the community.

Thomas: So what kind of interested me was the connection you made there between, the older thick, kind of fixed architecture that compression has and this sort of more general model. So if I'm understanding it rightly, these kind of generative auto encoders work by creating a set of what you call latent variables, which you can use to then generate an image at the other end. So it's some kind of hidden space in the middle and that's what you want to approximate and that's what you want to do dimensional reduction on, and that's what you want to transmit. And then this is what a decoder can use to generate your image or your video. So you kind of have a matched pair of encoder and decoder with these coefficients or variables in the middle, and that does sound very like conventional compression.

Maggie: Yeah, and you're right. And the other, I think, contribution we have is, if you think about it, so in the variation of auto encoder, because we still have to do entropy coding. 'Cause otherwise this latent feature that we get from the VAE is continuous, so the entropy coding won't work. So that's another thing that Zhihao worked on to sort of update or modify the configuration, and this is particularly with the posterior and the prior distribution part that he has modified so that it can support entropy coding and make sort of this end-to-end lossy compression possible for building on this variation auto encoder. So I think this is where we kind of, there's something people feel that it's quite interesting and brings new insights. And Zhihao, feel free to elaborate more.

Zhihao: Yeah. So basically that's exactly what we do. And so one thing I want to highlight is that our, so the method developed from the per, sorry, perspective of generative version auto encoders is very similar to that of transform coding. Yeah, so as we said, it's a generalization of the transform coding because, so one way to think of it is that, for example, in JPEG we have a DCT and an inverse DCT. And actually there are many types of DCTs. For example, when designing a codec, we want to choose one type of transform over others. But we may think why are we choosing this over others?

So if we recall DCT or other linear transforms, it's just a matrix transform, and that matrix is, it contains a bunch of parameters. And what we can do is why not just learn the parameters by these machine learning techniques. And if we make this transform very complex, it becomes deep learning and it becomes a kind of deep auto encoder. And yeah, and so by doing this, we simplify manual work because we don't need to try different transform configurations, and we just need to set an objective and we just let the entire system learn the parameters itself and it hopefully learns a very decent transform. So, this is one way to think to think of the auto encoders and the traditional transform coding.

Zoe: Before that, I think we are talking about what auto encoder is in the computer vision field. It's different from the encoder concept in the codec field. Go ahead, Thomas.

Thomas: I was just gonna ask, you're talking about learning these parameters. Is this something that happens online as part of the encoding process or do you do some, or is this in a sort of pre-trained situation? So could this system adapt in real time to the data that it's getting, or does it need to be trained?

[00:38:38 Pre-trained method]

Zhihao: Yeah, that's very insightful. So typically what we do is the pre-trained method. So we have a model, we have a huge data set. We just train the model on the data set and we use it and we clip it first forever. This is what we typically do, but there are indeed papers and novel tactics, like doing it online. So what we could do is that, first, so once we have a model and once we are seeing new data and we can design some algorithms such that if the model finds the new data, follow a specific pattern, for example, it finds the incoming data are all screen data or all face images, then it can update its parameters to adapt to that model. And, of course, like this is done in the encoding side and on the decoding side we need some time to do the same thing, so that the encoder and decoder match each other. So yes, I guess this is one of the promising features.

Thomas: I remember some time ago there was an MPEG project, called "Reconfigurable Video Coding" where the idea was that you sent a description of your algorithm as part of the bitstream and you could update it, and you can produce a rate distortion model that includes the entropy of your decoder algorithm as well as the data itself. But having this kind of general framework of, of video processing where you are just sending weights of different layers of those kind of things that have been pre-trained does open a lot of interesting possibilities to send updated weights to decoders as part of bitstreams and sometimes adapt, sometimes not adapt.

Zhihao: Yes, it is, so I would not say it's a more general method like in AI, but I would say it's a very similar method. So in AI-based coding, there are indeed methods that transmit the weights of the model in the bitstream. And so on the encoder side, we, for example, what we can do is we just over fit one model on an image and we transmit the weights to the decoder side. So using the weights, the decoder can perform very well on this single image. So this is what we could do, and some people tried that and they got decent results. I would not say state-of-the-art or very promising, but they brought decent results, but I would say it's a very interesting and promising future direction, in my view.

Zoe: Can I learn, for example, we talk about this is promising. So what's this current performance compared to the traditional? Because you are doing image codec for this paper and won the best algorithm. So I want to see, I think, a lot of the audience also want to see what's the kind of result using this AI codec algorithm. What's the compression? Because we talk about codec, I think the image, we all know that we want to see, to achieve the best quality, best possible quality, and then to have the least or smallest file size or bit rate. And that at the same time there's also the computational cost underlying during the encoding process. So what's the current performance that, based on your paper compared to the state-of-the-art image traditional codec result?

[00:42:40 What is the result using AI codec algorithm]

Zhihao: Yeah, so in terms of the rate and distortion, our method is slightly better than the VTM. So it is the reference software for BDC and slightly better. It was 4% BD rate, so it's only marginally better. But in terms of efficiency or speed, like running speed, computational complexity, learning based methods are all very complex. So we require a GPU to run on. We need a GPU to execute our program. Otherwise if we run on CPU, decoding a single, let's say, 512 x 768 image requires around half a second. It's untolerant in many applications.

Zoe: Encoding or decoding?

Zhihao: Just decoding. But encoding's slower, yeah.

Zoe: Okay. So basically you mentioned 4% at least, meaning that to achieve the same quality, there's a 4% up bit rate gain. Meaning that you are smaller by 4% compared to, for example, state-of-the art VVC results. The H.264 from side, but complexity is high.

Zhihao: Yeah, I would say, yeah, it's very high. It's far from, I would say, I would say intractable in practical applications. Yeah, and that's a very huge problem for practical use of learning based codecs and right now, we areYeah, go ahead, Professor.

Maggie: Yeah, so I think one other thing I wanted to point out is in this process we made an interesting discovery, because in essence the decoding is progressive. So at a very low bit rate, we can already sort of see in general roughly what is the content in this frame, in this image. So I think this is, when you look at trade off, I think it's important to know and everybody knows that these neural network-based compression is expensive because of the trainings and all of that. But I think that is not something that can be necessarily improved purely based on algorithms and software, because that would also need hardware support and all of that.

So that is something, I think I have confidence that can be addressed down the road. Just maybe not right now. But what's interesting for now we find is this progressive decoding. So essentially, depending on what the application scenarios are, we could basically tailor this framework for different application scenarios. So we touched a little bit on this concept of coding for machines. So at that, if this is what our target application is, then perhaps we don't need a very high bit rate to support, because we're not that interested in the pixel reconstruction aspect.

If we can get the features, the content from the images or the video that is indeed needed for some downstream vision task, then were fine. So in that sense it could support these different scenarios. It depends on what we use it for. So I think there is also that aspect of generalization and then how can it be applicable for the different application scenarios. On that aspect I think is also important to consider for this work as well.

Thomas: In comparing these kind of algorithms with traditional compression algorithms, you are kind of fighting on your enemy's ground, as it were, because you're being assessed on pixel matched image metrics like PSNR or SSIM which require perfect alignment. Would it be possible to get big gains in complexity reductions if you had to produce something that looked very similar, but wasn't necessarily aligned in terms of the actual locations, exact locations of pixels and so on? Is there a degree of freedom that maybe you're not exploiting right now?

[00:47:35 AI generative models]

Zhihao: Yeah, there is one small area of research in learned image coding is that generative image compression. So the objective is not to exactly retransfer the pixels, but to retransfer the pixels to achieve good perceptual quality. So one very trivial or very simple example for this is to use AI generative models. For example, if we have a image of grass and we just save it as a test consisting, consisting of like a prompt, say, an image of grass. And in the decoder side we just use an AI generative model to generate a arbitrary grass image. And then like if we look at the image, I mean, the perceptual quality is very high, but it's just not the original image. So this is one. So in some applications, it may be good.

Maggie: Yeah, and then I'm sure Zoe now remembers we actually worked on something like that.

Zoe: Yes. So basically this is combining compression and synthesis. So what Zhihao mentioned that you get some certain information and through the learning you identified, actually got some knowledge regarding the detailed content and the user synthesis on the decoder, with the player's side to synthesize those textures. Actually I think this is a very great topic, because based on what you find out in your paper, Maggie and Zhihao just mentioned that there's a progressive decoding.

So I see actually at least three benefits from progressive decoding that we just discussed. For example, one is some of this video codec or image code applications, not just for humans but for machines, 'cause we compress the big amount of data and finally we just want to communicate or get the data shared and finally, to be processed by machines. So machines may not need that perfect reconstruction, as some information that can be extracted is made good enough for the machine, like what Thomas mentioned.

And secondly, is maybe sometimes we don't have enough bit rate. So sometimes we have very low bandwidth. So each of the networks is not as good, but they still get some information, even for human eyes. And then just now we also mentioned that potentially, because of progressive decoding, it can potentially combine with some synthesis work that actually reconstructs some images or rebuild on the decoder side, we actually at least present a great info, even though it'd now be completely phased out to the original source.

So then that means that there needs to be some new metric of how to evaluate the performance of how that works. Any new trends in academia right now? 'Cause Thomas just mentioned that you actually force your work to be compared against the system work, using PSNR and SSIM. That's basically pixel by pixel just the original data, just reconstruct data, you try to compare them bit by bit to evaluate the result of this work. Any other works along this line?

Maggie: I think right now this is still at its infancy. So we know a lot of great people working on qualities on metrics and then it hasn't quite, I guess, get enough attraction on this topic to have people think about work on that. But you're right. So that's one of the reasons why I'm advocating to have both in academia, and that's my familiar territory, to bring people together, to try to say, hey, here is really cool, we have a lot of new ideas, we've done some groundwork and then we've seen some promising things that could happen.

Now, how can people who understand about standardization, how can people understand about the qualities can help us think about these, because this needs to be addressed. Like you said, we're still using the same old metric for evaluation, and then that may not be fair, 'cause we are playing in other people's playground, in their sandbox. So we need something new. So yeah, so this is all I think will be happening very soon. We've seen quite a lot of other groups around the world that are interested in these problems, and hopefully, with a little bit more critical mass we can move this forward a little bit faster.

Zoe: Well, thanks for the sharing. Actually, I think right now we have talked quite a bit and in sometimes we talk about quite some deep technology algorithms. So I'm here to try to wrap up this. Even though we still have a lot of things, I actually really enjoyed it. I'd like to actually for the first time to have each of you to at least share something about your thoughts on AI or machine learning on codec. And I may want to go to Maggie first, and then because for here you are professor, and then you can have many topics to explore. And then I know that from the website and then from the bio of the paper, and you have touched a lot of areas, and then motion learning to codec, what's your thought? And just one point is good enough.

Maggie: Yeah, so, I think to me, there was a great paper by Professor that talks about, so the question at the time, that was not really too long ago. I think maybe less than 10 years ago, and people think compression is dead. So he and a couple of others wrote a paper and said that's not the case. So at that time, of course, machine learning hasn't become as popular or a trending topic yet. But it was at sort of a dark place for academia because there is no work left for the academia to do. Except, yes, there are small problems, there are certain applications we can work on but from a point of view from academic contributions to the community, it becomes very difficult. And then this also is related to how do we find students that are interested in working on these problems? How do we find funding agencies and sponsors to sponsor the work.

But now, with this growing trend of AI and machine learning, and I've been asked. So I have some industry experience and at the time, since people know that I have some compressions, so they always ask what's different? What can be new and groundbreaking? And I don't have an answer at the time. But now I think as we start to pick up the work in this space again I see hope. And then this is to me where, really, I'm very interested. I do a lot of things but this is something that is, to me, very close to my heart just because of where my training comes from. So I do hope that we can attract, we need more people, we need more people for interest in working on this problem and we need people who have different domain experts so that we can answer all these hard questions that we talked about today. So that's my 2 cents.

Zoe: Wow, yeah. We're all looking forward to the future, basically. And then back to Thomas, 'cause you have been actually working in this area, especially for standardization and from HEVC to AV1 and then you have been highly, heavily involved in the standardization, but you also have been involved in the codec product for all these many years. So what do you think about AI?

Thomas: I think it's really exciting and I think we touched, really, on one of the sort of the key roadblocks at the moment is metrics to assess the true potential of these algorithms. I think we don't have a really good metric for subjective quality. And there's been lots of good work, say with FIMA and so on for trying to assess things better, but they're still kind of pixel mapped. And for a long time people have been wanting to introduce synthesis algorithms into compression.

So I think that's one route to a kind of hybrid ML, non-ML kind of codec. But the other thing that is really exciting is these very general auto encoder architectures that can introduce at least not computational simplicity but conceptual simplicity into this whole space when our codex standards have got very, very complex and hard to optimize. So I see these two, these two major features that will progress in in the next few years.

Zoe: Well, thank you. We're talking about the complexity of codecs, as well as we do need a metric. I think talking about metrics is way more common then, 'cause remap, it seems at least right now, has been widely used, which was originally proposed by the Netflix team. Actually, it's also leveraged machine learning. So actually when we try to evaluate machine learning applied to codec and we need a metric, again, we may rely on the machine learning development to find out a codec or objective codec we are talking about to really leverage the subject which is a human eyes justification, or maybe not aligned with the human eyes justification, but aligns justification from the machine side. So there's a lot of machine learning actually involved that I see.

Last, but of course not least, Zhihao, I'd like to have your opinion, because, again, you are the third year student into the PhD program. Now, I have to say that it's very, what you done has been very impressive and that you won a Best Paper finalist as a first author and then again, for this year we're talking about that WACV and you won, with your co-authors, won the Best Algorithm Paper, again, you as the first author. So not only just for the direction of AI codec, but I want to talk a little bit, like we have some thoughts about your career that you think about academia, industry, and what do you think? Of course related to what the work you have been doing or you are going to continue to work on AI plus codec. So what do you think you foresee for your future? Maybe not too much, three to five years?

Zhihao: Yeah, so for me personally, I am more interested in an academic position 'cause, because I enjoy doing research, I enjoy exploring new things. So for industry, I also like engineering. I also like tuning parameters and making things work, so it's very important. It is what's made the world work. Yeah, so both of them are very, very interesting, but for me I just, so when I was a child, I just wanted to be a teacher. I personally prefer a damage of more.

Zoe: Yeah, you really enjoy it. You want to be a teacher and then also you want to do the research. And then, what do you think of the research direction of AI plus codec?

[01:00:43 Codecs of AI]

Zhihao: So there are many interesting and promising directions. So I would say in general, there are three for codecs plus AI. As we said, there are codecs with AI, so we can use AI tactics to improve codecs. There are codas for AI. We can design codec, like coding algorithms specifically for AI instead instead of human. And there is another one, namely codecs of AI. So, right now AI models are very huge. It's very big and there is an area of research that is coding of AI models themselves.

So we are not coding images, we are not coding videos, we are coding the AI models, the weights of the AI models themselves. So it's coding of AI, and so I think all of those three are very interesting and promising and they, they might not be useful in the current world, in the current state, but I believe, maybe it's five years, maybe it's 10 years, maybe it's 20 years, but eventually they will all play a role in the industry. Yeah.

Zoe: Well, thank you, and actually thanks everyone for getting to this. I think it's very exciting and an insightful episode. So we talked about the three things. Just to echo what Zhihao just mentioned. So we talked about a codec for AI, codec with AI, and codec of AI. And then of course there's a lot of work and a lot of hope and something to look forward to, so thanks everybody, and from a different corner of the world to join this episode. So thanks for our audience.

Maggie: Yeah, thank you, Zoe, for inviting and hosting us, and then thank you, Thomas, for the great discussion.

Thomas: Thank you; really, really interesting, thanks.

The VideoVerse

TVV Ep 14 - Codec for AI, Codec of AI, Codec with AI

Listen to this podcast on