You see closed captioning everywhere nowadays. You probably even enjoy using it at home on your TV or when in a crowded place on your phone to watch videos.
Closed captioning is not only handy, but it is also a necessity for the Deaf and hard of hearing.
Closed captioning is also required by law.
In this session, Mark Shapiro speaks with Darlene Parker who has extensive closed captioning experience. She is an advisor to the FCC's Disability Advisory Committee. She retired from the Director of Partnership Development at the National Captioning Institute (a nonprofit captioning services company). Darlene was also the sixth person in the country to do real-time captioning for broadcasting.
Lily Bond, Senior Vice President of Marketing at 3Play Media, also joins Mark to dive deeper into services offered by closed captioning companies.
Lily and 3Play Media also provide a breakout session titled "Quick Start to Captioning."
You need permission to access this content
You must have an Event-Sessions account to proceed.
Access to this area requires a sign-up & login that is separate from individual events registrations. You must use the following link Register for access now to receive a password-setup email from us.
If you already have Event-Sessions login credentials, please login now
or Reset your password (opens new window) if you have forgotten it.
NOTE: A single login provides access to all of our Event Sessions.
Closed Captioning Services Discussion
Transcript for Closed Captioning Services Discussion
Opening
Lori Litz
Hi everyone! Welcome to today's Accessibility.com event, Closed Captioning Services. If we haven't met before, my name is Lori Litz. I am the Director of Conferences here at Accessibility.com. We're so excited to have you here today. We have a great lineup ahead for you. First up, we'll have our President, Mark Shapiro, speaking with Darlene Parker, who is currently on the FCC’s Disability Advisory Committee.
She retired from the National Captioning Institute as a Director of Partnership Development. And... and this is really amazing. She was the sixth person in the country to do real-time closed captioning. She's just a wealth of knowledge about closed captioning. I've had some really interesting conversations with her and I am so excited for you to meet her, if you haven't before, and hear what she has to say about closed captioning.
From there, we'll have our friend Lily Bond, who's the Senior Vice President of Marketing at 3Play Media. She'll dive into 3Play a little bit more to give you a little bit more in-depth look at a closed captioning services company. We'll have some questions and answers. Please type your questions into the Q&A. If we don't get to them today, we will follow up with you.
Today's event is recorded and will be available this evening for your viewing pleasure. You'll get an email from me at some point this evening with instructions on how to access the recording and transcript. And without further ado, here we are with Mark and Darlene.
Presentation
Mark Shapiro
I'd like to thank everyone for joining us today. Today we're going to be talking about closed captioning. I'm Mark Shapiro, President of Accessibility.com. And i'm here today with Darlene Parker. Darlene currently serves on the FCC’s Disability Advisory Committee and was a Director at the National Captioning Institute where she worked for 40 years. She was the sixth person in the country to provide real-time closed captioning for TV and coauthored several books on real-time captioning.
Darlene, welcome. I appreciate you coming on board and helping us understand about closed captioning.
Darlene Parker
Thank you for having me, Mark.
Mark Shapiro
To start with, what is closed captioning?
Darlene Parker
It's the text that is displayed on a screen of the audio portion of the program. And the reason we call it closed captioning is because when it first came about in the broadcast world, the captions were open, meaning they could not be turned off. They were big white letters across the screen. And i will talk about that more a little later.
And to view closed captions in the early days a set-top decoder box had to be connected to your TV set. And starting in 1993, it was mandated that every TV sold or manufactured in the U.S. must contain a decoder chip which made the decoder boxes unnecessary.
Mark Shapiro
Are there different types of closed captioning?
Darlene Parker
Yes. Until a few years ago, live captioning was always done by a person, either by a standard type machine or a voice writer repeating what they heard. And prior to becoming a stenocaptioner , that person was required to graduate from court reporting school. The voice writers may have had prior training or were more likely trained by the captioning company that hired them.
The screening process to become a voice writer is rigorous. One must have great recall be able to punctuate on the fly and stay calm under pressure. Just because somebody can talk, doesn't mean they can be a voice writer. Eventually captioning expanded to online content onsite conferences, college courses and meetings with stereocaptioners that were often called CART writers. CART stood for, and still stands for, Computer Aided Real-Time Translation and they provide captions for the Deaf and hard-of-hearing individuals.
I'd like to pause for a second here and say that the correct term to use is always Deaf and hard-of-hearing and not hearing impaired.
Mark Shapiro
Great point. When do you see closed captioning actually being used?
Darlene Parker
It's used to make any content accessible to anyone in any setting. Broadcast, online content, business meetings, college courses, conferences, etc..
Mark Shapiro
So for Zoom meetings would that be a place you'd see people using it?
Darlene Parker
Yes, absolutely.
Mark Shapiro
Are there any specific laws requiring closed captioning?
Darlene Parker
Well, my understanding is that the ADA, the Americans with Disabilities Act, says captioning must be provided in places of public accommodation which are public or private businesses that are used by the public at large. And private clubs and religious organizations are exempt. And aside from the law, providing captioning is a good business decision and it's the right thing to do.
Section 508 requires all federal government agencies electronic content to be accessible. And lastly, the FCC oversees broadcast television.
Mark Shapiro
Okay. Darlene, what should we look for when we're considering a vendor?
Darlene Parker
Well, you want to look for somebody, obviously, that has experience in the particular type of captioning that you want to do. And you can certainly ask them what their experience has been.
Mark Shapiro
How are these vendors providing their closed captioning services?
Darlene Parker
For live events online, the client provides to the caption company a link to their website and a good quality audio source. Captions are sent to a closed caption encoder or an IP address or a standalone widget. And for broadcast captioning, it's delivered based on each client's workflow requirements. And even though many online meeting platforms have their own built-in automated captioning, they can also allow human captioners to provide captions by being assigned as a third party. So their captions would override any captions that the meeting platform might provide automatically.
There are also many captioning companies and tech companies who provide ASR, Automated Speech Recognition, captions to the various online meeting platforms.
Mark Shapiro
When I'm hiring these vendors, what's a reasonable accuracy rate that I should be expecting?
Darlene Parker
Well, experienced vendors will have a higher accuracy rate. But the accuracy will depend on the quality of the audio and the difficulty of the content. And across the industry, considering all different types of live captioning, accuracies could range from the low nineties to the high 90% range. So from, like 92% up to 99%. That's live. Post-production captioning is more accurate and verbatim than live captioning and it really should not have any errors at all.
Mark Shapiro
If I'm running a live event and I'm at 92% accuracy, does that cause major problems or is it more of an inconvenience?
Darlene Parker
I think it can cause problems. I think it can impede comprehension. Yes.
Mark Shapiro
And do vendors actually guarantee a specific number? Like they would actually say, we will be at 95% or we'll be at 90%.
Darlene Parker
Many vendors do guarantee a number. Or a range.
Mark Shapiro
What are you seeing as kind of the traditional turnaround time or what should be my expectation when I'm hiring a vendor in terms of turnaround? Time for post-production closed captioning?
Darlene Parker
Well, it depends on the volume. Standard delivery is 3 to 5 business days, but it can be turned around the same day or the next day if necessary. The client just needs to make their needs known as soon as possible to the captioning company.
Mark Shapiro
Budget is obviously a question that's going to come up in everyone's mind. What's a reasonable budget for closed captioning? How would it even be charged? Is it... is it hour? By word? How do most of these companies look at billing out?
Darlene Parker
Yeah, it depends on if it's live or if it is post-production. If it's live, it's going to be charged by the program. If it's a 30 minute program, for instance. Or by the hour, if it's an hour long program or a several hour long program. For post-production, it can be charged by the minute if it's a short piece or it may be charged by the hour.
And the rates are dependent on many variables. You know, like I said, is a live captioning with humans? Or is it with ASR, automated speech recognition? If it's ASR that can cost 50% less than a human captioner. Post-production captioning is also referred to as prerecorded or offline captioning and that costs more than real time captioning because it actually takes longer to accomplish than the real time captioning which just last as long as the event lasts. And also the accuracy standards for post-production are higher because they have the time to make sure that everything is correct.
And when setting a rate, captioning, companies may take into consideration certainly volume. Is your event just a one off event? Will it occur daily, weekly, monthly, once a year? Those are all factors that need to be considered. And are you looking for multiple languages, perhaps? And that would increase the rate as well. And if it's post-production captioning, if you need a quick turnaround, then that will be a higher rate.
Mark Shapiro
What's a reasonable lead time for securing services.?
Darlene Parker
Again, the word volume comes up. It depends on volume. Is the job several events over a period of time? Is it one event lasting 3 hours? For shorter events, a week's notice is fine and even less than that if you can't find out sooner than that that you need to provide captioning. I would like at least a week's notice, especially if there's a lot of volume. For larger events such as live multiple concurrent sessions during a conference, as much notice as possible is preferable. A month would be great if possible. The reason is because that captioning company is going to have to provide more than one captioner for each hour of that job. And certainly I know the client doesn't always know a month in advance what's happening, but as much notice as possible is always the best way to go.
Mark Shapiro
Should I assume when I get my transcript back that it will be accessible? When I get it back from the vendors?
Darlene Parker
Well, when using a captioning company, you shouldn't expect a transcript to be delivered automatically. Certainly, if you need one, it's best to make that request in advance. There's no problem providing that. And if you forgot, just make the request as soon after the event as possible.
Mark Shapiro
Darlene, like to thank you for helping us understand about closed captioning. Appreciate it.
Darlene Parker
You're very welcome. It was a pleasure to be here.
Mark Shapiro
We're now going to do a deep dive into a company that offers closed captioning services. We have with us today Lily Bond. She is a Senior Vice President of Marketing at 3Play Media. Lily, i want to thank you for allowing us to do this deep dive into your company.
Lily Bond
Thank you. Really excited to be here.
Mark Shapiro
To start with, can you share with us the background of your company?
Lily Bond
Sure. So 3Play Media was founded in 2008 out of MIT. It was founded by four people that were trying to solve the problem of how to make recorded captioning faster and cheaper for businesses. And they ended up developing kind of a proprietary editing technology that combined best in class ASR and speech recognition well before anyone else in the industry was using that with human editing to produce extremely high quality output at a fraction of kind of the traditional cost of captioning.
So, very early to the space of ASR and captioning, but really with the angle of how we can use humans along with ASR to produce extremely high quality enterprise-level outputs.
Mark Shapiro
Cool. So how does this translate into different products and services that you guys offer?
Lily Bond
Yeah. So today, you know, our first product was, was recorded closed captioning and today we offer a suite of other video accessibility services all focused on this combination of technology and humans. So we provide closed captioning, transcription, live captioning, subtitling and translation, and audio description for for Blind viewers.
Mark Shapiro
How much advance notice do... would you expect from companies? What's reasonable? Is that a week? A month? A day?
Lily Bond
Yeah. We don't need any notice at all. You can just upload any time and we'll be able to process captioning for you immediately. Any individual customer can submit 40 hours of content to us in a single day and we'll be able to get it back to you within the guaranteed turnaround time. For live captioning, ideally we ask for 24 hours notice, but we have a scaled marketplace of live captioners and typically they'll... they'll assign themselves to an event within 5 minutes of it going online.
Mark Shapiro
What would you say sets you apart from... A lot of people would consider using independent contractors or even your competitors. What would you say sets 3Play apart?
Lily Bond
Yeah, I would say that we're... we're an enterprise solution. So if you are a business that is looking for a scaled approach to video accessibility or that centralized approach that I talked about, we're going to be able to give you that level of service. We're going to be able to give you the SLAs on turnaround, all of the guarantees on making sure that your files are back and captioned and have all of the quality SLAs in place that you care about for your brand risk.
We have a full service platform where you can order any of these services in one place. So it kind of sets you up for what you need immediately, but also what you might need in the future. Really, what sets us apart is this this this partnership approach for enterprise organizations that are looking for a full service centralized solution.
Mark Shapiro
That's great. Lily, Thanks for allowing us to do the deep dive. Appreciate that. That's really helpful for our audience to understand what, you know, especially something that's new to it, what companies offer.
Lily Bond
Happy to be here.
Mark Shapiro
And if you can stick around, we're now going to do our Q&A session. Got a bunch of questions. First question, are their best practices a company should follow when putting closed captioning on their videos. Darlene, why don’t you take this one?
Darlene Parker
Yes. Even when providing captioning for a nonbroadcast client, the FCC's 2014 order enumerating best practices for prerecorded programs will most likely be followed by captioning companies. In addition, experienced companies will also have their own rules and guidelines. A quick synopsis of the best practices of the FCC for offline, also known as prerecorded programing, is that captions will be accurate, synchronous, complete and as appropriately placed as possible.
Those captions should be verbatim. Captions should be error free, punctuated correctly so as to facilitate comprehension. Captions should be displayed with enough time for them to be read and they should not obscure visual content. They should include speaker IDs and a complete representation of the audio. And that means ambient noises and other nonspeech information.
Mark Shapiro
What are the legal requirements for closed captioning? Lily, can you take this?
Lily Bond
Sure. So there are several laws in the U.S. that apply to closed captioning. The very first was the Rehabilitation Act of 1978. This is... there's a section in there, section 508 that is basically the first broad nondiscrimination act for people with disabilities. This applies to government and federally funded programs and requires equal access to video content through closed captioning.
The Americans with Disabilities Act of 1990 applies to public entities and also places of public accommodation and requires closed captioning for video. This one has... This is the law with the most litigation because it was written in 1990 before technology and the web is kind of as prolific as it is today. But in almost all cases the litigation points to closed captioning and kind of digital-first experiences being covered under the ADA.
And this would apply to, you know, any organization that provides a public accommodation. So lots of, you know, higher ed, lots of businesses, anyone that is kind of like providing business online. And then in the media space, the FCC has very strict quality standards for closed captioning and very strict requirements. And then the CVAA is kind of an extension from the FCC to apply to streaming media that previously appeared on television.
So that's when you're thinking about your, you know, your Netflix or your your Disney Plus. Those are all covered under the CVAA. And there are... so these laws all kind of require captioning for different industries. And then kind of the overarching standards I would say are covered under the Web Content Accessibility Guidelines or WCAG, which says exactly what an organization must do to be compliant with accessibility standards.
So the ADA might say you have to be accessible and WCAG is going to tell you what you have to do to be accessible. All of this covers closed captioning.
Mark Shapiro
Next question. What are open captions? Darlene, why don’t you take this one?
Darlene Parker
Sure. Open captions are always displayed like subtitles. For closed captioning, however, you have the ability to turn the captions on and off. I like to provide just a brief history of how we got here. In 1972, the French Chef with Julia Child was the first open captioned program. Big white letters across the screen that could not be turned off.
In 1973, ABC World News Tonight with Frank Reynolds was the first regularly scheduled news program. They recorded the 6:30 feed, got it onto a big eight inch floppy disk, and then at 11:00, they punched out on a computer keyboard line-by-line what Frank Reynolds had said at 6:30. And that's all there was for the news until 1982 when it became real-time.
So eventually it was felt that, you know what we need to have closed captions. We don't want them always open. We want to be able to turn them on and off. And the National Bureau of Standards had been working on this, on closed captioning. And in 1972 they demoed closed captioning of the popular TV show at the time, The Mod Squad, on a specially equipped TV set.
And this was at Gallaudet University. And it was eventually decided that a nonprofit organization was needed to provide the closed captioning. So in 1979, the Federal Government, several networks, established NCI as a nonprofit. The first prerecorded captions were on March 16, 1980, for The Wonderful World of Disney. And then NCI provided the first live real-time captioning in 1982 for ABC's World News Tonight.
Mark Shapiro
What's the difference in quality? When you say quality, isn't it just, you know, typing out what somebody's saying?
Lily Bond
So yes, but the level of accuracy that people are typing that out with varies. So you know, when you're looking at the dollar a minute range, you should expect a lower accuracy rate because there's less kind of checks and balances in place to ensure that the quality is over a 99% accuracy. I think in the $2 to $3 minute range, you're getting extremely high quality over 99% accuracy.
And then when you're in the $5 a minute plus range, there are some additional standards that are required by streaming media that incur extra costs. It's not about the accuracy of the word for word at that point, but it's about including elements like being able to format for different commercial breaks and including, you know, music notes and very specific, you know, lyrics based on on kind of difference standards by streaming platform.
Those are just much more extensive requirements that are less about the accuracy of the transcript and more about the specific standards of the streaming platform.
Mark Shapiro
Okay. What should I expect is an accuracy level, let's say from a company like 3Play?
Lily Bond
3Play Media, we guarantee over 99% accuracy on every file. We measure at 99.6%, and we have an SLA in place to guarantee that quality on every file.
Mark Shapiro
Okay. This is a really good question. Why shouldn't we just use automatic speech recognition? Zoom has it built right in. Is there anything wrong with just using that? Is there a benefit to actually paying a firm to go through for a live event?
Darlene Parker
Again, this depends. In my capacity at NCI, I was part of the team that conducted analyses of ASR versus live human captioning. The decision to use ASR or a human captioner will depend on the content, the difficulty of the content, and if good quality audio can be provided. To obtain the best possible results with ASR, it is imperative to have good quality audio. ASR technology has come a long way in the last few years.
ASR with good quality audio and generic content is extremely accurate. If the vocabulary is challenging and can be provided in advance, the ASR system can be programed with that specific vocabulary to help increase the accuracy. ASR strength is verbatimness. It is more verbatim than humans. The downside is that ASR can misinterpret words. Human captioners can use their critical thinking skills and take their best guess at what was said if they're not sure.
Another downside to ASR at this time, I say at this time because it's always evolving, is that it does not capture song lyrics well. The music obscures the lyrics. If there is a good deal of singing in an event, captions may be absent or spotty for that portion. And with most other programing though, ASR can be on par with humans.
ASR is more accurate in captioning videos and live programing because it can take much more time to interpret what was said. With live captioning, time is of the essence because you have to ensure that captions appear on the screen as soon as possible.
Mark Shapiro
Great. That's all the time we have for the question and answer. If anybody has any other questions, just send them in and will get them answered.
Closing
Lori Litz
Thank you, Lily, Darlene and Mark for such an engaging conversation on close captioning services. Today's event was recorded and will be available later this evening. I'll send you an email with instructions on how to access it and the transcript. So if you had to step away, got here late, you'll be able to reconsume the content that's available. Today
is not over yet though. Out in the lobby or if you're already viewing this from the lobby, up next is 3Play Media's own Lily Bond is back for “Quick Start to Captioning.” So you can learn a little bit more about what's involved with working with a closed captioning services company and how to get your content closed captioned. You can also head out to the Expo Hall and visit with 3Play Media.
And coming up next for Accessibility.com on March 12th is PDF Remediation Services. So that's an interesting one. If you've ever tried to make your PDF accessible or need to make your PDFs accessible, that's a great event to come back for on March 12th. As always, thank you all for attending. We're so happy to have you here and we'll see you next month.
I'm gonna go ahead and get started. Thank you. Everyone for joining us today. The session is titled Quick start to captioning presented by 3 play media, and my name is Lily Bond. I'm the Svp. Of marketing at 3 play, and I've been there for about 10 years just a quick self description. I'm a white woman in my thirties, with short brown hair, wearing a grey sweater before we get started with today's presentation, I have a few housekeeping items to cover. This presentation is being live captioned, and you can view those captions by clicking the CC. Icon in your control panel.
It's also being recorded and will be shared by the accessibilitycom team after the event.
Please feel free to ask questions throughout the presentation, using the QA. Window or the chat, and we'll save questions for the end of the presentation. With that I will share my screen, and we can get started.
Perfect. As I said, I'm I'm Lily. I'm the head of marketing at 3 play media. Ii run our marketing team and strategy as well as a lot of our thought leadership. And I'm really passionate about the space of accessibility and present very often on the topic for a quick agenda. I'm gonna go over a close captioning overview a live captioning overview cover the benefits and compliance requirements for captioning and then we produce a report every year called the State of ASR or Automatic Speech recognition. And we'll go through some of our findings on kind of the latest state of the art technology when it comes to speech recognition. And then we'll save some time at the end for QA.
So first of all, what are closed captions? Gonna define a few of the basics here, so everyone's on the same page time. Sorry. Close captions are time synchronized. Text. That syncs with the spoken audio track in a video usually noted with the capital CC. Icon on a video player.
And this is an accommodation for deaf and hard of hearing viewers. This was mandated, mandated by the Fcc. In the 1,900 eighty's, and is also covered under many other le legislation, like section 508 of the Rehabilitation Act. The Americans with Disabilities Act and the Cba captions should include not just the spoken word, but also important non speech elements like speaker identifications and relevant sound effects. The key there is relevant for example, if you have, if you hear
the sound of keys jangling in someone's pocket, and they're walking down the street. You would not want to add a caption for keys jangling because it's not relevant but if it's a horror movie. And there are keys jangling behind a door. That's a critical plot development that would need to be included in in the caption track.
Just to get everyone on the same page with some terminology.
I want to cover captions versus subtitles versus transcripts. So captions, at least outside of the Uk refer to the English or sorry, the the source language. I text track that that is time synchronized with the video and captions really assume that the viewer can't hear the audio, so they include all of that critical non speech element, speaker, identification, etc.,
whereas subtitles assume that the viewer can't understand the language so subtitles. Refer to a translation of the audio into another language. And these often do not include those non-speech elements. Because the sound effects and and other non-speech audio can be consumed by the viewer. Without a translation. But the actual spoken word must be translated
the only caveat here is that in the Uk. They often use the term subtitles to refer to both captions and subtitles, as we would call them in the Us.
And then transcripts are similar to captions, but they're not time synchronized. So they contain the text of the audio but it's not time synchronized. And this would be a relevant accommodation for something, like a podcast where there's no video element included. But closed captions are required if you do have both video and audio.
So there are multiple ways to create captions. There's a do it yourself. Method, in order to do that, you would basically transcribe the audio and speaker identification and and non speech elements and go through and add time codes to to every line, so that you have what we would call caption frames. That you would create with text by time code.
You can also start from automatic speech recognition, or ASR, which will give you kind of both a rough draft of the text and the time codes. And then you can edit that to bring it up to a higher quality. Or you can use a captioning vendor. There are lots of different vendors out there to replay is just one of them. But the key here is you want to look for a guarantee of over 99 accuracy. Or you're not getting, you know the quality that you deserve.
Starting with that speech recognition process.
So I want to talk a little bit about why caption quality matters. accurate captions are required under the law.
It has been noted in multiple lawsuits. That automatic captions are not good enough for closed captioning as an accommodation. And there are a lot of things that go into quality. So
first of all, you want to see that 99 plus accuracy rate. That's a word error rate that we're talking about there. And even at 99 accuracy, you're gonna see 15 errors in 1,500 words. So you can imagine as that accuracy rate goes down. You're gonna see a lot of errors. That really compound over time.
I there's also some some standards. I particularly under the Fcc around. You know what makes quality caption quality captioning. This includes the accuracy, that we just talked about in in terms of, like the equal word for word, accuracy rate. It includes placement so the captions should not obstruct other critical visual information, and should move to the top of the screen. If there's something in the Lower third
it also completes covers completeness. So you want your captions to cover everything from the very beginning of the file to the very end of the file. You don't want it to cut off early, or not include the entirety of the audio track.
and then synchronicity, so that should be as in sync as possible with the spoken word. There are some other best practices around quality. Particularly if you are creating captions yourself. You want to make sure that you are including within a single caption frame no more than 32 characters per line. No more than one to 3 lines per caption frame, and the captions should last a minimum of 1 s on screen, and it should be using non serif font.
So there are 2 types of I of captioning styles. There's verbatim, or there's clean read for something like a scripted television show you would want to use verbatim, because every single utterance is included in the script. That's the the as the restarts for clean read. You might think of something like a lecture where a professor might start a sentence and correctly and and correct themselves. And I and move on.
likely want to just use the kind of like corrected part of the transcript and consider that clean read.
There are also lots of ways to publish captions the most common ways as a sidecar caption file. That would mean that. You're using a video player like Youtube or Vimeo, and you upload your video file and alongside it you upload a caption file and those 2 files associate so that you can toggle the captions on or off
encoded cop captions and open. Ca, captions are both burned into the video. So encoded captions are really II intended for offline video, something like a kiosk where I, you would be able to toggle the captions on and off. But it's actually part of the video file
and then open captions are burned into the video and cannot be turned off. And then there are lots of workflows, and integrations that really streamline this process so that you don't even have to worry about how you're publishing captions. The your captioning vendor and your video platform will work together to make sure that you are getting exactly the caption file format that you need associated with exactly the right video type that you need. 3, play has over 30 different integrations with popular video platforms. And it makes it very seamless for you to be able to just select your video for captioning in your video platform, and then have the captions posted directly back to your video when they're complete.
Hey? Moving on to live captions.
So cap live captions are also called real time captions. These are captions that are delivered in real time, for live events live broadcast and meetings
these can be delivered either by automatic captioning or a human captioner using either voice, writing, or stenography, and you should expect some slight latency with live captioning because of the time that it takes for the captioner or the speech recognition to access the audio in real time, and deliver the captions back in as close to real time as possible.
So if you remember the FCC. Quality standards, these do according to the FCC. Applied to both live and recorded. Captioning and synchronicity was one of the requirements they have a. They have more leniency for live captioning in their kind of understanding of how those standards should be deployed. Because it is, you know, virtually impossible to get the perfect synchronicity with a live environment. some terms to know these are terms and acronyms that that we use in the live captioning space. So Lpc refers to live professional captioning. This is captioning with a live like human captioning professional either via stenography or via voice writing cart is communication, access, real time. Translation. This is often what you refer to in higher education. When you have a person. Providing captioning in person in in a classroom. And then ASR is automated speech recognition. Which we've referred to a few times. This is available for both recorded and live captioning and caption quality matters for live as well as I said, there's some leniency for quality for live because of the environment. Whereas in recorded there's no excuse for imperfection, because someone has time to review a file multiple times and make sure that it's virtually flawless in the live experience. You should expect a range of accuracy depending on whether you're using ASR or humans for for ASR, only you're gonna end up in the, you know, 80 range depending on how I what the audio quality is like. So, for example, if you are I recording into a microphone, and it's a single speaker. You might have higher accuracy with ASR, whereas if you are trying to use ASR to caption a basketball game. There's tons of sound around. Lots of speakers, lots of crowd noise. That will lead to lower accuracy.
For humans. You want to target kind of the, you know, 95 to 99 accuracy range. 98 is the target under the FCC and you want to target, an average latency of 3 to 5 s. So that, as I said, is just a necessary requirement of the live environment. But you don't want to end up in a situation where you have, for example, 30 s of latency between when the spoken word is output on the on the event, and when the captions appear. I do want to note when we talk about accuracy and caption and quality for live captioning. There are, multiple ways of measuring this. So when I was mentioning you know the accuracy rates of the 95 to 99 or the 80 ASR, I'm using word error rate, and that is, as I mentioned in the recorded section, a measure of word for word accuracy.
There is another measure in live captioning that you might hear, which is an er this is more of a meaning for meaning quality score. I wouldn't call it an accuracy rate, because it is really a score of
how critical the errors are versus an actual rate of the word errors. So with any are it scores
based on whether an error is kind of like critical to the meaning of the
i of the speech. But it it
allows for things like summarization and subjectivity and determining what that critical error is.
So you just want to be really, really careful when you're hearing vendors talk about accuracy rates. They're promising accuracy for live captioning. You want to make sure. That. You understand what they're talking about. Particularly if it's ASR they're probably referring to to any r which would end up with a much lower word. Error rate. So you would, you would. You might have a 99 er score. But I, you know, an 87 word error rate. So these are just really under S, important to understand and question your vendors on
some best practices for live captioning. I already mentioned good quality audio, and little to no background noise. You also want a strong network connect connection so that your captioner stays
connected and in real time with you. And then, you know, single speaker or non-interrupting speakers with clear speech and pronunciation are some other best practices here. moving on to the benefits of captioning. They. There are many benefits of closed captioning. I could talk about it all day. Obviously, accessibility is the first and most critical benefit. There are 48 million Americans with hearing loss, and close captioning is a critical accommodation for these people
beyond that closed captions help with SEO or search engine optimization. So if you think about a
I like Google reading an article on a page. It can surface all of the keywords in that article in a search. But Google can't watch a video unless there's a transcript associated with it. So by adding captions and transcripts to video, you're really allowing search engines to. I crawl the entire contents of your video and that really helps brands with greater traffic and visibility.
Particularly I on platforms like like Youtube or social media. These will help your video be found for a much broader range of keywords.
Captions also help improve
brand. Recall verbal memory and behavioral intent. This was a study by Verizon. And
- And there have also been some interesting studies by Oregon, State and University of South Florida, Saint Petersburg, proving that captions help with comprehension, focus and engagement. With students. And and certainly with broader audiences. So 98 of students find captions helpful, regardless of whether they are deaf or hard of hearing.
65 students use captions to help them focus. And 92 users are watching videos on mobile with no sound. So if you don't have any captions on those videos, they're going to be inaccessible to a much broader audience.
I also mentioned there are several accessibility laws that impact the need to caption. So the first is the Rehabilitation Act of 1,973 there are 2 sections here, section 504, and section 508 is applied to Federal and federally funded programs. Requiring broad equal access.
Requiring equal access for definite hard of hearing viewers. Anytime there's video in involved the Americans with Disabilities Act was written in title 2 and Title 3 apply to
public entities in places of public accommodation.
and I. There has been a lot of litigation under title 3 places of public accommodation. In terms of whether the web
should I be included as a place of public accommodation. So 1,990 was well before the Internet was as prolific as it is today. Virtually all litigation comes out with the with a consent decree or a decision that the Ada does apply to the web, and therefore close captioning is required for video
for any organization that's providing a public accommodation to this covers certainly higher education, but also businesses. You know, I medical spaces. health care, financial services, etc. So it's extremely far reaching and has been broadly applied to the web I and requires close captioning for for video.
The Cba is very specific to the media space, and it requires closed captioning on any streaming media that previously appeared on television with captions. So, for example, I you know, if you are a bachelor fan, and you're watching it on Hulu, that appeared on ABC with captions on broadcast. And so it has to be captioned on Hulu
and then the Fcc. Has very strict caption requirements for all network television and certainly for for recorded on live broadcast.
And they have, caption quality standards, as I mentioned earlier, that apply not only to to broadcast, but also to the Cba requirements.
There's also the web content, accessibility, guidelines or wicc, this I most of the accessibility laws reference wicc, as the achievement criteria that organizations should shoot, for in order to be accessible, so while the law may say you require equal access. They will then point to Wicc to say exactly what that looks like
and for video under level A, which is the the easiest to achieve this requires transcripts for audio, only captions for video and audio, or text alternatives for
for audio description, and then under level double A. It requires captions for both pre-recorded and live content. Audio description for pre-recorded video and under level Triple A. It requires extended audio description, and lower, no background audio, and the ability to navigate by keyboard alone. It also requires sign language for video under level triple, Aaa.
I mentioned at the beginning that we do some research on speech recognition every year.
I want to cover this a little bit, because everyone's very, very interested in when
and if speech recognition is sufficient for close captioning, so we have a vested interest.
The speech recognition spaces space is evolving and how close it's getting to to being good enough for captioning spoiler alert. It is definitely not good enough for for captioning today. Although there have been advancements, and I'll talk through that a little bit.
So this report is one that we've put out every year for the last 5 years or so. We review the top 8 to 10 speech recognition engines, and how they perform for the task of captioning and transcription specifically. And we test this for word error rate, which I've mentioned a few times and for formatted error rate.
formatted error rate incorporates the word for word requirements of the the worst score. But it also includes things like punctuation. Speaker, identification, non speech elements, those things that are critical to the observed accuracy of a caption file, and to some of the requirements of captions.
That I talked about at the very beginning
and our goal is really to make sure that we're saying on top of what's changing in the industry, that, as I said, that we're using the best of the best ASR, and that we understand what is still required to reach that 99 plus accuracy that's required for for an accommodation and for legal requirements.
So just to dig into this a little bit more when we say 99% accurate captions. That's the word error rate. It means that one in every 100 words is incorrect.
You would see a lower percentage accuracy if you look at that with formatted error rate, because it would
identify errors like grammar and punctuation. And if it does not have a speaker, speaker, label, etc., all of that will degrade the accuracy.
There are a lot of common causes of ASR errors. Certainly.
All of the things that I talked about for live captioning are true for recorded captioning as well. The second you degrade your accuracy, your quality of audio, the more and more you'll have compounding, and accuracy is with your I with your ASR track. So multiple speakers overlapping speech, background, noise for audio quality, false starts.
acoustic errors and and function. Words are all things that are are really common errors with ASR, and then with formatting errors, you're gonna see a lot of challenges with ASR in terms of I the limitations it still has around, adding, speaker labels, adding kind of like, non speech tags punctuation, grammar numbers, etc. And then, of course.
something that I is not in a in a dictionary yet. There are lots of proprietary terminology. You know proper nouns, names that I may or may not be in an ASR dictionary, that I will make it difficult to you know, end up with that
Higher accuracy.
So in this past year's report. We I looked at 10 different engines.
This is a measure of word error rate and formatted error rate by vendor. That top line. Item smx plus 3. Play is speech, Madix, which is one of the top vendors.
You on I in order assembly AI and speech. Maddox performed the best.
The smx plus 3 play is some post processing that we do on top of the ASR with our own modeling. To improve the the output by you know, around 10, and and that same modeling could be applied to any of these vendors. So we have. We've been doing captioning since 2,008. We have millions and millions and millions of corrected words because of our process. And we model all of that on top of any speech recognition that we use to produce captioning.
So even at the very, very best word error rate. You're still looking
you know, under 93 accuracy for word error rate with speech, recognition. All the way down at the bottom of the list you have Ibm and Google with 25 to 28 error rate. Which gets you to 72 to 75 accuracy.
there are lots of different reasons for this, depending on the purpose of
the speech recognition. So something like speech. Maddox is intending to be used for captioning and transcription. Something like I like Google maybe intended to use for a personal assistant. So it's training its model on being the best personal assistant that it can be via, you know. I the so like the th. The things that you can do with personal assistance are ask for clarification. It is you. You're usually a single speaker talking right into your phone, for example, saying, like.
Hey, Google, what's the weather?
oops that just caught my Google Pixel?
those are very, very different ways to train a model and you likely see something like Google perform worse for ASR for transcription, because it's being trained for a different purpose.
the formatted error rate
is coming in about 10 lower or 10 percentage points lower across the board. And this, as I said, is the experience to accuracy. So as humans, when we're reading, we experience the la, the poor grammar, we experience the poor punctuation.
Our ability to read it effectively is dramatically impacted by poor formatting. So
if you consider the, you know, 17 error rate, 17 error rate, we're really looking at more like 82 when you're including formatting, and that is well below what anyone would expect.
in terms of an accuracy rate that is acceptable for captioning. So we still have a long way to go in terms of speech, recognition alone being enough
and the longest way to go with formatting errors.
I did want to break this out by industry, because I think it's a representative of where there are easier and harder spaces. So
I I've included multiple industries here. And this is an average of the top 4 vendors, but the average word error, rate and formatted error rate for e-learning performs the best e-learning is heavily produced
high quality audio with a very specific dictionary. So it makes sense that that content performs really well.
Even still, you're only at about 96 little under 96 word error rate, and a little under I
87% formatted error rate.
where you see really really challenging content for word error rate and formatted error rate are things like cinematic content where I it's a extremely complex material and news where there's a ton of background noise, a lot of overlapping speech.
And I. Those are the instances where speech recognition performs the worst.
So some of our key findings here. There are some new entrants emerging, you know, 2 years ago, whisper and Assembly AI, which are in the top 4 last year didn't exist. So they've emerged and they've emerged really strong. There's some exciting companies out there.
The results are heavily dependent on the audio quality and the content difficulty.
So the source material really matters and the used case matters, as I, as I mentioned, engines are ultimately being trained for different use cases, and you want to make sure if you're looking at them for captioning that you're looking at engines that are being trained for captioning and not for another purpose like like personal assistance.
We also saw evidence of hallucinations. So in Microsoft whisper, product there! Sorry. Open. AI is, I whisper, product I that there were hallucinations where
I a sudden topic change, for example, a new segment where? There was a weather report, and then it transitioned to a I like
lifestyle reports. The speech recognition assumed that
the original topic would continue and hallucinated content even after the topic had changed. So it started transcribing basically fake weather.
reports to continue the topic that it expected versus you know.
I transitioning with a topic change as it happened in the audio track. So hallucinations are extremely dangerous for accessibility. Think about, you know the transition of a new segment to an emergency report. If it continues trying to hallucinate a weather report. And instead, there's like an emergency alert for
like, yeah, I don't know like a fire in your area, you would completely miss that because of the the potential for hallucination there.
So errors add up really quickly. And I mentioned that kind of formatted error rate being in the 80 to 85 range at 85. Accuracy, one in 7 words is incorrect. That's really really awful for accessibility.
And it also performs worst on brand names and proprietary terminology. As I mentioned. This is a a screenshot from a video that came out at the beginning of the pandemic that used auto captions. And it reads, I'm Jeff Blues, President, and instead of jetblue because that was not in the dictionary. It's more likely to to create an error there, and that creates a ton of brand risk and reputation risk
before I finish we at 3 play I love producing educational content around captioning and video accessibility. We have weekly blogs and tons of free content and webinars as well as our annual access events coming up with 2 days worth of sessions from from thought leaders and industry experts. So we hope to see you on some of those. You're welcome to sign up for those on our website.
And I know there are some questions coming in, so I will stop sharing my screen and
we can go to questions
great.
So there's a question here. If closed captions are provided for a live event like a webinar and the captions are of a high quality. Should a standalone transcript also be provided?
that's a great question. I think that a lot of people I,
a lot of people enjoy accessing content in different ways. All that is required is for the closed captioning to be available. But many people will also provide a link to a live transcript so that it can be accessed in multiple ways based on people's preferences. And it's always the best practice to share a transcript of the webinar after the event is concluded.
So I would say I, you know absolute best experience. Provide as much accessibility as you can. Including kind of the the Transcript and the captions and absolutely the transcript after the event.
next question I we conduct breakout rooms for our live webinars? Do we need to provide captioners for each breakout room, or can we reconvene from breakouts and synthesize or restate what happened in each breakout room, so it is captured by the captioner.
This is a tough one. I.
If there are people who are deaf or hard of hearing require accommodation in your breakout rooms, you should have a caption, or in each breakout room. They will not be able to participate in that session. Regardless of whether it's being kind of like synthesized afterwards.
so my recommendation is always to provide the access in those situations.
next question, what is the threshold for accuracy, for events, for live close captions to be within the law?
We touched on this some II just going back to it. The Fcc. And the Us. Requires you know, word error rate, or were accuracy as close to equal as possible. They tend to target 98%.
With that that metric? And that would mean that 2 out of every 100 words are wrong. This is I something that you want to be careful with when when asking vendors whether they use word error, rate or ner to measure that, because ner would give you. Certainly, you know a lower word error rate than than is acceptable. Next question.
can you provide insights into the comparative effectiveness of automated closed captions versus professionally generated close captions.
Has there been any recent white papers published, or research conducted to provide a quantitative analysis of the precision and general quality inherent between these 2 methods. Wow! This must have come in before II talked about our state of ASR we can provide a link to I our full study. In the chat, and we'll also have this year's report coming out. You know, in the next month or so.
This is the kind of like most comprehensive a view of where automatic closed caption stand. In terms of professionally generated captions.
You know, at 3 play we have a measured accuracy rate of 99. And that's what we use as kind of the source of truth for a lot of this research. I because we have, as I said, millions and millions of corrected ASR words.
We have a lot of confidence in that accuracy rate.
couple of questions. I related to audio description. Could you address how you manage audio descriptions?
and should you provide audio description?
Sure. So audio description is also required under all of the legislation that II covered in this presentation. It is a wicca level, double a requirement, and wiccad level. Double a is referenced in most litigation and most legislation in the Us.
So audio description should be required for any pre-recorded video. That you are publishing publicly or providing to people who are are blind or low vision and and require that access.
Audio description is a secondary audio track that narrates the relevant visual, the critical visual information in in a video. And I in many cases, can kind of be turned on or off by the viewer. And you should be, including audio description anytime. There is critical visual information.
Professor, is I I drawing on the board, and they are not describing exactly what they are drawing. Then you're gonna need audio description for that.
And I think, last question here. Do you have recommendations for organizations to effectively balance the need for accurate captions with considerations such as budget constraints and type deadlines.
Absolutely. So. My best advice is to prioritize your content.
So think about first what is publicly available. And I what has the most views and the most reach, and start there.
Lily Bond: then kind of go through your backlog in order. But you want to protect where you're most liable, and also where you have the most brand risk.
You also want to. I position where the most value in terms of accessibility is. I also always recommend thinking about who in your organization needs this? Not just your team, but across your entire organization, where the needs for captioning and how can you work together? To I,to get a better understanding of the volume your organization needs, because the more you can centralize and work together to to make an assessment of the volume you're going to need for captioning.
the more. Your vendors rates are gonna come down. So most vendors have kind of volume pre purchase options, or volume estimate options, where, if you're you know, doing 1 h of video, versus video, you're gonna have very different rates. And usually you need to work together across your organization to get those benefits.
Great. I think that's all that we have time for. Thank you. Everyone, for joining us today and for asking great questions. I'd also like to thank our 3 play captioner for making this session accessible.
And I hope you enjoyed our presentation today. We were really thrilled to to be able to work with accessibilitycom on this. If you have any questions about today's presentation, please don't hesitate to reach out.
Thanks again, everyone. And I hope you have a great rest of your day.
Downloadable Files
Click on any of the file-links below to open or download.
You could also Right-Click and "Save As".