TVV EP 21 - Live Captions In All Languages is Hard! Wait, is it really?

Live Captions are Hard! Wait, are they? Browse on over to the latest episode of The Video Verse where we discuss accessibility via Live Captions and Dubbing with Chris Zhang, Senior Solutions Architect at AWS, and Gio Galvez, VP of Business Development at SyncWords. We discuss a bit of history, then move onto the current state of things, followed by where accessibility is headed.

Watch on YouTube

Welcome to "The VideoVerse."

Zoe: Ready. Hi, everyone. This is Zoe, and welcome, everybody, to come to our new episode. So for this episode, I have my co-host David. Hi, David.

David: Hi.

Zoe: Hi, with me. Because he was in our previous episode already. And then for this episode, we actually have two guests again, to come to our episode, and Chris and Gio. So I will let both of them to introduce themselves. So first, let's go with Chris.

Chris: Hi, everyone, this is Chris. Thank you, Zoe and David, for inviting me to here. I really appreciate the invitation. My name's Chris, Chris Zhang. I currently working at AWS. I was a solution architect that specializing in media services. I got in the video industry like early in 2007. By then, we don't have OTT streaming. We don't even know. We don't even have a phone that can play video right, properly. I think in 2008, when Apple introduced the first iPhone and 2009 they released the Apple HLS OTT streaming specification. That's where I got involved early with a startup company that trying to processing video and making packages try to deliver in the video and as OTT streaming.

But those are early days. There are a lot of challenges in processing those. And also in these days, we can see the industry is converting, just started to using H.264 and later on it's HEVC right now and AV1. So, there's a lot of happening in the past years and I'm very happy and interested and thank you to invite me. We can talk about the topic we want to talk today.

Zoe: Yeah, yeah. So we heard you. So we definitely we'll talk about something related to what offered, what is being offered, what potentially being offered by AWS Elemental. Yeah. All right, so Gio. And then we'd like to hear some introduction from you.

Gio: Sure. My name is Giovanni Galvez. I am the Vice President of Business for SyncWords in New York. We are a company that provides video accessibility using artificial intelligence, utilizing AWS services for live streaming. About myself, I am a 20 year old veteran of the broadcast industry and my specialty comes from adding accessibility to videos, both prerecorded as well as live streaming. Within that time, what we did is we eliminated the need to go tape to tape. If you remember those days, that's when I first started. To add captioning to a video, you'd have to have a tape and then another tape deck and you'd send it to the dub house. And with my team, what we did is we eliminated that need, integrating with video editing systems like Adobe, like Avid, Final Cut, in order to make a very simple digital workflow. And now that's the standard now. For live streaming, that is a work in progress and we're hoping to achieve the same standard, to make that fully scalable.

Zoe: All right, thank you. And then so today, because based on the background, just from our guests Chris and Gio, we are going to talk a little bit about captioning. And then, so for me and actually previously talking with Gio, there's a lot of things and for my side as a user, I just take it for granted. And for example, if there's live streaming, especially I think there's a lot of supports live streaming with all kinds of different live streaming use cases and there's a lot of captioning. And I love captioning 'cause sometimes it actually provide a lot of possibilities. Even I'm away, where I'm doing something, where I miss some words.

But today, we're going to talk about the two major things. One is, what is the caption actually involve multiple languages to provide? Because right now, we talk about globalization and so you never know which user will be really interested. And then I believe that it really covers multiple countries, multiple languages needs. And secondly, as you can hear from the background from Gio, we're talking about live streaming. So to me, I actually really want to learn that this is multiple captioning offering and how that actually could be very challenging, especially serving the livestream use cases. Maybe Chris, you can address and then we can go deeper, like along this theme and talk about this topic.

David: It might be good as well, of course before we begin, to give everybody a little bit of a background as well. 'Cause this all kinda started with broadcast standards and so forth and how that's now floated into digital. So, for the users that it maybe are a little bit new. I myself am not an expert in captions or subtitles, so it'd be good to kinda do that little bit of background there and how we progressed over the years.

Zoe: Yeah, that'd be good, yeah. Thank you.

[00:06:01 Evolution of Captions: A Historical Overview and Progression]

Chris: I think I can give a little bit background. I think Gio is more like expert in here. The captioning is a standard in old days in broadcast word when you watch the TV and this standard is coming in EIA-608 and later on when the analog to digital transition happened, the standard changed to EIA-708, right? So the captioning is when you watch your TV, you can choose the closed captions. They are captions CC1, CC2, all the way up to CC4. There are four channels in there. So how do you get the captions into your TV in the first place? The most popular solution today, it's actually using a captioning encoder. So you have a stenographer waiting on your live event. They're actually listening to your event and using a special device to put the captions in there. So Gio, you wanna add something in there?

Gio: Yeah, yeah. What I would add is that it is correct. The standard is human captioning with a court reporter for broadcast television and they have to connect remotely to a closed caption encoder. They used to have to connect via analog phone line, believe it or not and sometimes still do. But now they can connect to it via IP and not only does that have to happen, but oftentimes, there's like two or three captioners that they trade off every half hour. It's gotta be seamless. So it's quite a process.

One other thing I wanna mention is that captioning was designed to work this way, meaning like anything that has to do with live streaming or digital video of any kind was started as more of a kluge to be honest and evolved into like a new method, a new workflow. And it's still evolving. And as a result of that, we see a lot of kinda canned workflows for live video and live streaming having a lot of limitations. And that's what a lot of people will find, because the broadcasters actually have all this worked out. They have the contracts with the captioners. They just put an order. It's all systematic. They got the gear in the trucks and they got the gear in the stations. And there's no way the average person has all of this in order to make all this work.

Zoe: So basically this is a traditional way. And then, let's see what... We're talking about live streaming and then we're talking about multiple. So you basically mentioned that this is already could be a hassle, but in the traditional way, there's a succinct arrangement. But then, with a new request, for example for live, for multiple language and then how this actually bring the challenges to this kind of traditional we're offering.

Chris: I can speak a little bit about that. Gio, you can compensate or just chime in. I think for the challenge we were trying to talk about is how to we enable multi-language captions. The caption, when we talk about captions, we might interchangeably using subtitles today. 'Cause captions in American, there's FCC regulations for caption. You have to have very higher accuracy in there, otherwise you are not qualified for the caption. So we might say it's subtitle, but we are trying to use the word interchangeably today.

Before we move on to the multi-language captions for live event, let's see how it's going to be done today in a traditional way, right? I mean people are used to it. So if you have set up a live event and you need a English caption for example, and you really need for large enterprise or most of the customer today, they have two choices. Either AI machine learning based caption services or maybe using a human captionist, who you have people are involved, directly involved with your live event. And you have to book your event in advance and schedule the people captionists to be tapped into your live event.

And there's some special devices involved. Typically those are like iCap or caption encoder to be able to send your video and audio, especially the audio to your captionist. They listen to your event. And the encoder, we call it caption encoder, can normally delay your video in like two and a half or maybe three seconds. The delay is to accommodating the captionist to type in the captions or the word in there, coming back to the caption encoder. So the captioning encoder now you have audio video delayed a little bit and try to sync up with your caption text and try to put them together as a combined transport stream.

And then, the next step will be, goes to like the general video processing go through a transcoder. If you are doing OTT delivery, you're transcoding into multiple bitrate and you put it into your origin and deliver the whole thing. So that's typically the process today. But it's very challenging to enable multi-language captions. The reason being, the protocol it used today is 708 or 608. That protocol only support Latin-based languages. I think the standard specified seven languages. And these are French, German, Italian, Dutch, English, Portuguese, and Spanish. So you can see that it doesn't allow you to do Japanese or Korean or Chinese, those unicode two byte type of languages.

Zoe: So my question... Sorry to interrupt. I'm just wondering because you mentioned there's seven Latin languages currently falling into that standard. So what about those languages who do not, that do not fall into this country? Do they also follow some other standards?

Gio: Yeah, I can speak to that, Zoe. There are standards in other countries for broadcasts. Broadcast standards. Like for example, in South Korea, there's like a special 708 that they use. Right? For their televisions and cable boxes. But those standards are really kinda focused on cable and over the air broadcasts, not so much live streaming. And as a result of that, let's say I want to do a live stream, it's not the kinda thing that you can just kinda plug and play into your live stream to support these languages.

So to your question, there are protocols out there, but also those are very limited. And I'll tell you another thing. For example, like I was in Manila. They have a requirement now to do live captioning. Same thing in Norway. And there were no court reporting software available even to type it. So even if they hired people, the only way to do it is there's just no way to use a regular keyboard to type that fast. So really speech-to-text is the only option for them to do it live.

Zoe: Oh, okay. So they were doing really speech-to-text transcribing?

Gio: That's right. And interestingly enough, Zoe, you may not be aware of this and I think the general public isn't either. Even in the United States, there's such a high demand for human captioning that when you book a human captioning, oftentimes they're using something called a voice writer. It's not always people that type. It's people literally that have trained with speech-to-text system and they're just repeating what they're hearing. So there's almost always speech to text already in there, even when you hire a human.

Zoe: Wow, okay. I was not even aware. Go ahead, David.

David: That's an interesting point. We've been talking a lot about broadcast to digital captions. I'm curious what the current state-of-the-art workflow is for digital only broadcast and captions, 'cause we're kinda getting into that realm now. So, can either of you speak to that?

[00:15:50 Current State-of-the-Art Workflow for Digital-Only Processes]

Chris: Yeah, that's a good transition point. So what we have seen is that there's a lot of native cloud-based live event workflows and also customers are moving from on-premises to the cloud. So one of the challenge they're gonna ask you gonna be, "Hey, if I have my live event originated from the cloud, if I have to do captions today, I have to bring the stream back to on-premises, goes through the encoding process and back to the cloud." That's one option, right?

Or there are certain service providers, they providing similar like iCap captioning encoder services in the cloud as a separate service. But you'll still need to go there, get your captions merged and then go through the transcoding pipeline. So the challenge here is that you still... Your video processing pipelines are still following the same path. You have to using a camera, get your video and audio, you bypass to get your captions in there and then you do the transcoding encoding. But still, they're using the same protocol. They're using the same standard 608, 708. So that standard itself is not supporting multiple languages. So that's why, you cannot just follow that path.

David: Okay, so you need something to kinda bridge that gap from 608, 708 to sidecar files as an example, right?

Chris: Absolutely. For example, yes.

David: Yeah, okay.

Gio: I'd like to add a little bit to that. So just for the folks watching at home, let's just take a simple workflow, right? Let's say I'm streaming to YouTube. A lot of people do that I heard. And let's say you want YouTube captioning. You want live YouTube captioning. I'm not talking about uploading your MP4. I'm talking about your streaming using OBS or vMix or whatever. You've got two speed bumps along the way if you want that live caption.

The first speed bump is if you wanna do it the way that Chris explained, where it's like a cloud-based caption encoder. You have to send the RTMP there, they do something to your video, 'cause the caption data has to be in the video, in the codec, possibly transcode, add delay. Then you have to then send it to YouTube. And YouTube can only receive one language for that workflow. And that language has to be one of the Latin character languages that Chris spoke about.

So the two speed bumps are, no one wants their video transcode to be done by a captioning system. That's a bad thing. If I'm streaming, I'm spending all my money on a nice encoder, video camera, all this. You don't want that to be transcoded. You don't want it to be delayed. You also wanna do it in any language. And you wanna translate that as well. You want to leverage all these great AI generative services that everyone's talking about now. Like everyone's like, "Hey, I have it on my phone, why can't I do it with YouTube, you know, on my live stream?"

So it's because that workflow has been built to mimic the broadcast television workflow. That's the problem. And not just a broadcast television workflow, but the American broadcast television workflow, which is based on English, Spanish and whatever other broadcast languages there are there. So if you're gonna do that, the only option is if you're doing like say Chinese, is you're gonna have to burn it in live, do like this picture-in-picture. And a lot of people actually do that. But it's not really closed captioning. And as you know, when you burn it in, you can't turn it off. You can only have one language. I suppose you can kinda stack 'em, but there's limited real estate there. You don't want that either.

So what then? What to do then then? And I believe the answer is there needs to be a way where the mechanism doesn't touch your video. It passes through in its original quality and that you can have it done in any language. And if I wanted, I could translate it into 20 languages if I want. 30 languages if I want. And it should happen immediately. It should happen with AI. And not only that, but there should also be a way where you can have multiple audio channels. What if I want real time voice synthesis translated? This is really the challenge, that as a workflow for live streaming we're faced with because that's the future. Using generative AI to have full global accessibility of your live streams.

Zoe: Right, you're basically saying that even like the captioning provides several languages. Let's say just two, and then the transcribing two, and make sure there's not much delay. And on the other side, I think you also bring up there's potentially the other way around. And this is new audio channels and this seems possible. Well, at least right now we see that's happening because of the current hottest topic in the tech field, which is generative AI. And so, all this is a possibility, right? Potential services that actually all the users can enjoy. But then, back to the current standard, it seems definitely there's a limitation down there.

Gio: Yes, that's correct. There's a limitation. And we were approached, I'd say three years ago at a trade show by a major broadcaster news agency. They were already captioning in English. Because they all have to, it's the law, FCC. So they already had captioning in English, but they wanted to grow their audience. They wanted viewers in southeast Asia, they wanted viewers in in Europe, Latin America. So they approached us and they said, "Hey, we have a workflow using AWS MediaLive and we want that translated to 10 different languages live, in real time." And so, we spent over a year thinking about it, looking at different options. No one had done such a thing. It was a unique thing three years ago. And we decided on HLS as the standard that would support this quite easily. HLS players are compatible with mobile devices, are compatible for web devices. There's tons of players. And so that was the beginning. And when you're saying is it possible, yes. Not only is it possible, but it's been a reality for now two years with this broadcaster.

Chris: Yeah, that's where I think I started working with Gio as well on this live captioning event venture. I think basically the technology are already in there, like WebVTT sidecar in Apple HLS specification to support multiple languages. And we have all the transcoding encoding video processing pipelines already everywhere. You've got on-premise encoders as contribution encoder or maybe transcoder and also in the cloud, there are multiple services available.

What are missing? I think in order to... We recognize that this is a industry-wide problem based on our customer feedback, but in order to facilitating a proper solution, we need to answer some questions. I think the first thing when I started working with Gio That we kind of agreed that the sidecar WebVTT file is the right... How do I say it? It's the right format we can deliver for this particular solution. And then, the rest is that how we can find out the possible integration point, how we can find out a proper way to enable this in the industry.

The first thing will be, we need to figure out what is the best possible integration point. Like Gio mentioned that you don't want a caption encoder to be able to touch your video or people want to keep their video pipeline intact 'cause in the current status, a lot of people doing OTT video delivery in the cloud already. How do we leveraging their existing video pipelines and turn on the caption capability without having to make significant changes to their current video pipeline. 'Cause people want preserve the videos and quality and their current delivery end-to-end structure, they want that to be intact.

So, and also we don't want to have something proprietary. I think that we are trying to solve a issue or for the industry, so most other people can benefit. There's so many vendors out there, how we can enable that as a standard. So what we are trying to work on, so even though it's not a standard, but it's something that we've been working on and in a phased approach. So what we were thinking will be we leveraging the WebVTT at the sidecar. So essentially, when you are delivering your live event in the OTT format, if we can add the captions, add additional audio track like audio dubbing into the master manifest. So you will have a new master manifest for your live event, which all the players are natively support.

So, you see the benefit, right? When you are able to generating a new master manifest for your live event, natively you don't have to touch any of your players. Your player natively supporting those already. And the customer, when you enable multi captions, they can just select each one and be done with it. So that's the key thing we were considering.

David: I think that that's a fascinating approach of kind of proxying it, kinda getting almost like a man in the middle, kinda pulling everything apart, doing some insertions and having some people consume a downstream master and child. The two things that come to mind immediately are synchronization and delay, latency. What kind of effects do this have? I mean livestream, I know it's annoying right now when I'm watching a game here, my neighbors are watching it over satellite or on cable or over the air. There's gonna be a 5 or 10 seconds, some people have 30 seconds delay unless you're Fox running last year's Superbowl, which did an amazing job from a latency perspective. Talk about how that's affected by these additional kind of like inserts quote/unquote if you will, going on.

[00:27:55 How Are Synchronization and Latency Affected?]

Gio: Yeah, I think I can speak to that. So the good news is HLS already has inherent latency for the most part. We see that. And that allows us to leverage that in order to synchronize the captioning. Which is really important. You know, people don't like to watch the captions and wait five, six seconds to read the text. And so, the latency is there. To synchronize that, there can be a way where we can leverage additional latency, especially if you're doing multiple languages and dubs. And right now, that's actually the next stage, is to try to minimize that as much as possible.

One thing that we're exploring is AWS's MediaPackage version 2, which now supports low latency HLS, which actually can really minimize some of this video latency that we're seeing, that you probably talked about David, when you're watching stuff on streaming versus on live television. And it's about where do we ingest the audio? So do we ingest the audio? Do we just pull it from the HLS? That's possible. We could also pull it from a low latency stream. Maybe they'll send us a WebRTC stream via Zoom or something. That's possible as well.

So, there's a lot of options there to minimize that latency and those are right now being explored. But overall, I think what we're seeing is just like a huge demand for accessibility in general. And this is so new that people are saying, "Well, what are my options?" "My options are, do I like try to figure out this really difficult workflow or do I just automate everything?" Which really keeps costs down and just make sure that I'm hitting some level of benchmark that's not too obvious with the latency.

Zoe: May I ask, for example, you even mentioned the low delay HLS and then so if using low delay HLS, and how all of these multi-languages are being processed, multi-language captioning 'cause... Are they... I'm just wondering because process multi- languages need some processing delay tolerated and then how that's actually compatible with low delay HLS?

Gio: Yeah, well... Basically there is just like when you hire an interpreter, you're absolutely right. You know, they first have to listen to the sentence. Once they understand the sentence, they translate it. Machine translation works very similar to that, where the sentence has to be formed and then at the end of sentence the translation gets triggered. So, there's a way to maximize that so that you get synchronization and then everything hits at the same time. But in order to do that, you have to be able to read the timestamps from the incoming video.

Zoe: Exactly, yes, right. Because that will be leveraged to synchronize all of this is in parallel channels. You have captioning, then you have the existing video, and then finally they got to be synchronized based on the timestamps.

Gio: That's right. That's right. So there definitely needs to be a few seconds in order for that process to get everything in sync. But as I mentioned, the actual stream itself, if it's like a normal HLS without latency, that can be minimized, to try to compensate for additional latency that synchronization may cause. Having said that, that's kind of an advanced feature because what we're seeing right now, there's also a way to do it the same exact way that broadcasters do it, which is you'll get that three to five second delay on the captions, which is normal and people are used to it. But to maximize that, there's a way to do it if we can leverage some upstream processes like the low latency HLS. And then, at the end, instead of having like something that's 30 seconds delayed, you can get that down to like overall end-to-end, sub 20 seconds.

Zoe: Sub 20 seconds?

Gio: Or less.

Chris: Yeah, the final end-to-end delay, I think it will be optimized as the industry adopting the common standard, how we we deal with the live captions. I think the delay on the caption actually comprise two major parts. The first one will be transcribe or converting from speech to text into a sentence. That's maybe your first delay, first language. The second delay will be translation part. The other part of the delay will be system integration end-to-end for OTT streaming. So from that part, I think that there's a lot of improvement today 'cause we don't have a standard, we don't have a native integration yet. But our goal will be listening to the customers. The customer coming to us, "We need captions." But if you give them a complex solution, you'll need a multiple steps to enable for live event, they're most likely not gonna use it.

So one of the goals is that we want to enable a workflow or interface that to make enabling captions for the customers like over a single click. Just imagine you are providing your live event workflow as a cloud provider and you are saying normally the customer today gonna hook up your video ingest and configure some parameters and send it to your origin and CDN will be delivered for your live events. It's a couple of configuration options. We were thinking would that be very nice for the customer saying have a UI integration saying there's a button in there saying, "I want to enable live captions." And then we can prompt them to say, "Okay, here are some of the options you can choose." "Here's are the price and here's..." Maybe we can say delay or may not.

But the idea is that customer can select the caption services and they enable the caption language they want and be done with it. They don't have to worry about how to set it up. They don't have to worry about how the integration gonna work. So that's the final goal from the consumer customer perspective. But in order to enable that, our first thought process that the best way to enable the better integration was actually at the transcoder level. The reason being, when you look at OTT transcoding today, the pipeline, it is the transcoder that transcoding a video into multiple bitrate, segment it and put... The transcoder actually writes the master manifest.

And the sidecar VTT is part of the manifest. You have to write at the same time when you are writing the video and audio segmentation or manifest. So that's where we were thinking that's ideal location to do it, but also you can do it in the origin like if you have a packager type of thing. We have a media package. That's also a possible integration point, but I think to benefiting overall, I think the transcoder, we think it's the best location to be able to ingest in the caption stream and be able to merge the caption stream with multiple languages and at the same time gonna write your video/audio along with the captions to produce a live captioning enabled live event.

Zoe: Right, so basically back to the idea you just mentioned where something's being implemented. So during the transcoding, already know that for video they actually, if we want to do HLS streaming, then they need to be, people always talk about ABR, right? For multiple resolution, multi bitrate transcoding. And then, they also do the HLS segmentation. And then, so in that stage that all we're talking about here is providing the captioning stream. And then that stream actually can be processed within this model and then finally bundled together with the video and audio part and get ready. So, because we already actually analyze what is actually being used or being done.

So I just wonder as a user, for example, what kind of stage are all of this has been developed? And then, are they already ready or there's some... I think we talk about, this is kind of a service, right? So caption-as-a-service. And then the end goal is, I think basically Gio, that we want to provide at least mentioned by Chris, just one button for the users to get this feature enabled and then they suddenly can actually enjoy this great features of multiple language captioning enabled. I just wonder, so what is the current stages and what is actually being standardized? And so that like our audience listening to this story can have a clear idea about what is going on and what they can expect?

[00:38:44 What Are the Current Stages and What Is Being Standardized]

Gio: Yeah, that's a good question. Let me try to begin and perhaps Chris, you can give more detail. But one of the things we wanted to do is to provide a solution for a standard workflow. Using a standard platform that does HLS streaming. And so, we primarily looked at the MediaLive media package CloudFront workflow that's available there. And the reason is because if you look at all the different OTT solutions out in the jungle, out in the wild out there, everyone does things slightly different. And that's just how it goes. Not all HLS manifests are equal, I guess you could say it that way.

And, so from our perspective, we wanted to make sure that a standard workflow, a workflow that's popular, a workflow that's in production at many... Streaming workflows is supported. And we also wanted to make sure that it made it simple for people to use. So from our side, we've built like a simple user interface where people can just plug and play their manifest, select the languages and hit go, and then that's it. You know, it's very simple and it has to be scalable. Originally, this was a bespoke solution two years ago that we did for a broadcaster. And now we've built that into like a very simple, very scalable platform that allows people to just leverage it.

So, that was critical. Because it needs to be scalable. Imagine, there's millions of live streams happening every day. Anyone that's doing HLS should be able to get this working. Beyond that, there has to be an API available for all of this, so that there can be even simpler integrations for the user where you could build it into your... When you're actually setting up an encoder, for example, and saying, "Hey, I'm gonna stream and oh yes, I do want captions and I want these languages." So, that's also something that is in production now, but there's still a lot of work to be done to optimize all of these things.

Yes, it's a kinda thing that it's quite new, but the goal is that anyone could use it. You know, if they're setting up a live stream with MediaLive, they should just be able to plug and play that right into the AI translation and speech-to-text engines.

David: I did have a question coming back to Chris. I think one of the things I wanna make sure I understand this, and this might be off on the left tangent, so I apologize if it is. But I think what you were proposing, if I'm correct, is something akin to the transcoder becoming a little bit like the captioning engine or captions encoder that are there today. I think, am I right in understanding that you were proposing a change like a transcoder specification to be able to accept the additional languages, be able to accept the additional subtitles or data files before it writes anything out? Is that what you were proposing?

Chris: Yeah, I can speak to that. I think today, what we built with Gio, is that what we were targeting are, it's not exist. That's the ultimate goal in there. So we have some sort of phased approach. The first one, we build with SyncWord in there is that we don't want to touch the OTT pipeline. So SyncWord had API backend and had the UI integration, so fully automated process. So customer can leverage and be able to... Customer can just creating their livestream OTT pipelines and give the URL for their event, and will be enable captions by leveraging SyncWord caption backend engine.

Ideally, when we coming to the final solution, we want to separating the caption services 'cause there's so many ASR engines out there. We want to make sure the customer have a choice of the caption services. And now you have a transcoding pipeline. So the transcoding, it doesn't matter whether it could be the MediaLive or some other transcoders, but we don't want to restrict the customer's potential video processing pipeline. So what we want to talk about is more on the bridging. So how do you bridge your transcoding pipeline transcoders with the caption services? How do you marry those together without having to dictating who to use? So that's what we are talking about.

It's a open API or API-based interface saying that if all the caption services can implementing the API... Let's say they implementing the API, saying that, "This is my caption services. I providing English, Chinese, maybe Korean, Japanese", all the languages available based on their caption service capabilities, audio dubbing, multiple audios, right? So the registration API or you call it the capability API, that gonna be the starting point. So which means, you will... The transcoder when polling the API, they will know what kind of services you offer. And once in the feedback, you can say what are the service's entry point? So let's say if I calling API from SyncWord, they give me entry point saying, "This is RTMP endpoint, you need to send your audio to me and this is something I require."

So, that will bridge the transcoding process with the caption process for audio ingest. And then, based on the feedback, so when the transcoder start a live event, 'cause we were focusing for live event, not 24/7. So 24/7 event, you only set up once. But for live event, you consistently create a event and stop a event. So when you're creating a event, you're actually using the API layer to talk to the captioning service provider saying, "I'm gonna create an event." "I'm going to send you my audio and video and give me back your cap..." "Give me the information where I can get all the multi languages captions from you."

So that's how we can bridge, so that the encoder transcoder will be able to know where to pull in the caption information. So by going through this pipeline, even if the caption service fails, your video/audio pipeline still works. You just need to build in the mechanism to, oh, if you don't see the caption coming in, so just ignore it, you'll continue to go with it, right? So, I think this particular open interface will enable the workflow, enable customer to selecting different capturing services, enable a redundant and a reliable workflow without having to go through lot of hassle or heavy lifting to creating their own solutions.

David: Got it, thank you. That's much more clear.

Zoe: Right, so basically I see that right now the APIs, I think, are being defined and then are being... I think you it try out. Based on the fundamental idea is you want to have the captioning services, right? For example, as a provider as a service, but kind of also independent with the existing infrastructure. But then, find a good way for these two services actually coming together with the scalability and also affordability. And at the same time, I think is just with the easiest way to provide this kind of service to be provided and have the flexibility provided to actually the user to either to easily turn on or off, select the languages.

At the same time, because it's a, I would say, if potentially like bridging these two services, come together with the flexibility. You can allow even they have the freedom to choose which caption services that users want to choose. But then, there are also a delay issues that also try to optimize, because flexibility sometimes will come with a longer delay sacrifice. So, like in my understanding, so this is trying to build simplicity and then flexibility, scalability, and at the same time the whole system need to be... Like an interface be defined to provide the optimized services, especially for live event.

Chris: Yeah, you're absolutely right. I think that one of the fundamental reason is that we want to provide, enable separation of duties. So transcoding, doing your transcoding job, doing whatever you do the best and caption services, you do your caption the best without having to mingle them together. So the API layer is more like the enabler or facilitator to enable the workflow.

Zoe: Got it, yeah. So, this is already being explored. Based on if this, we can see that is what is being defined, what is being standardized? So what what I feel like is the interface are being standardized or being explored. Then others can just follow. But underneath, there's a lot of issues we're talking about. And then we want this to have the flexibility and the users will have the freedom, but there's also a lot of potential multi-languages here we're talking about. And it's got to be optimized because just our talking about target on... For example, even SAP-20 kind of delays. So provide more, but still like with the existing offering and nothing or even more optimized in a way. That's the ideal.

Gio: Yes, I think optimization is where we're headed with this for sure. Also, simple and quick to set up, right? If people can set up their streams with just a single click and not have to worry about it, that's really the goal with this. You know, one thing I do wanna comment on what Chris said is true, there's kinda two categories. I mean, there's a lot of categories, but you can kinda put them into two, just to be general. There is 24/7 live streams. Like I'm talking news and sports channels. And then there is streaming events, where for example, you have people just streaming all kinds of the day, short 30 minutes, one hour, two hour streams. And they're both very different workflows.

On the side of the 24/7, there's a challenge there, because it is live now. And they can't stop it. Right? That's just not gonna work. And setting up a separate stream is also not... You know, it's a non-starter for various reasons. So there has to be a solution that it can fit right on top of something that's live right now. And that's going on all day and every day. And that's also the another challenge that we're meeting with this, is to say to these people who are broadcasting 24/7. You don't have to change what you do. You can stay with what you do and this is just a way to augment your delivery in a way to add accessibility to it. And then, a lot of cases, that's like music to their ears because you can imagine how difficult it would be for them to change a workflow that's already in production.

Zoe: Right, got it. So, I think we have actually... For me, this is a really new era, 'cause like we all have been talking about videos, audios, and then, with the captioning it seems actually is that auxiliary like channels down there. But then, as already mentioned that we actually here talk about a lot more enabler, enabling multiple languages. And then, for anyone they actually can enjoy different languages but without having to do a lot of redoing. But in addition to provide a flexible mechanism down there, make this possible.

And then, also we talk about potentially because of the AIs and they make lot of things, make audio synthesis quite possible down there. So providing further services to like... I think addressing a much larger opportunities in the market. So this is something that to me, it's exciting and then I can see in a way that is, we're talking about a solution down here, but now with this a solution prototype is basically target on a standardized APIs that actually can be served as a reference to addressing all the similar needs in this field.

David: Yeah, and on that point, have you guys thought about creating a draft API specification as a standard? Is there anything progress on that front?

Gio: Go ahead. Okay. Hey David, yes, yes. You know, there's this thing called like Postman collection and things like this, and we're actively exploring that because we do want... Everyone that wants something like this, to be able to do it. And we wanna make it so that's simple for them. There's less head scratching and more like just onboarding for this type of thing. So at least from my side, that's really important. And you know, for you guys out there in the audience, if you're a sci-fi fan, I know I am. I love Star Trek growing up. This is essentially kinda like the Universal Translator, where you had Klingons speaking and you could hear them in English. This is I think, what what we wanna build in some fashion with live streaming. And we feel that we're very close to that, to be able to make it so that you can leverage these amazing engines that do speech-to-text and translation in a way that that can be done in real time.

Chris: Yeah, I add to that. It's a really hard problem for the industry. Seems like simple, just the translation transcriber service. But when you really want to enable that on the workflow in a production scale and scalable, it's not that easy. There's too many parts can be falling down. The goal, we know what we are trying to achieve, but it's in reality, it's gonna be step-by-step or phase-by-phase approach. So, actually I'm very happy. I mean, so excited working with Gio in SyncWord. We do have a solution today, even though it's not the ultimate goal, but it's working very well. I think Gio, their team, spend a lot of effort in fine tuning this, lower the latencies and make sure it's working. There's a lot of work has been done in there. I'm very, very excited.

Zoe: All right. So it has been almost an hour. So I'm going to wrapping up this episode and then thanks to bring like a topic that we haven't touched before, even though we think... Because we're talking about like video three dimension. And then image, two dimension, signal audio, one dimensional, but now we're talking about captioning. It seems that there's the text down there, but then there's a lot of possibilities bringing down here. And then, I'm also learn... At least, I got a lot out of this episode. And I'm really grateful for Chris and Gio, actually coming to talk about this topic. I think we peeled a lot out and to look into what is really inside in this field and then we provide quite some possibilities to actually standardize this and make this flexible and potential standardized APIs to address this field. So, thank you so much. And then, I'll just put the end of this episode. And thanks for our audience to listen to this episode. I'll talk to you next time.

The VideoVerse

TVV EP 21 - Live Captions In All Languages is Hard! Wait, is it really?

Listen to this podcast on