How It’s Done: Automated Open Captioning

The American with Disabilities Act (ADA) law, first passed in 1990, was updated in 2008 (ADAAA, the American with Disabilities Act) to broaden protections across employment practice and agencies, state and local governments, labor unions and private entities of public accommodation. The changes in the definition of disability introduced a greater need for captioning services in every aspect of the public sphere.

The production and transmission of live captioning has long been challenged by high costs, availability, varied latency, and inconsistent accuracy rates. These challenges are often tied to reliance on live stenographers. While these highly trained professionals do excellent work, their specialized skillsets demand high hourly pay rates. Availability is also a challenge, particularly when stenographers are required on short notice.

While perfection is impossible due to the speed of live captioning, the transition to more automated, software-defined captioning workflows introduced a new series of challenges. While automatic speech recognition removes the costs and staffing concerns of manual captioning, the performance of servers and processors has been problematic to accuracy and latency. These issues are magnified for facilities that must offer captioning to remain in ADA compliance, and to better serve the needs of diverse audiences:

· An increased need for live, legible captioning for lectures, seminars, presentations, and meetings
· An exponential increase in the volume of video to be captioned, including libraries and archives
· The need to produce recorded versions for distribution over digital and mobile platforms
· Serving audiences who speak another primary language, particularly in education and houses of worship

The Rise of Open Captioning

Closed-captioning is traditionally associated with broadcast TV, with government mandates existing worldwide to ensure that deaf and hearing-impaired viewers can fully understand and enjoy on-air programming. Closed captions, simply put, are encoded within the video stream and decoded by the TV, set-top box or other viewing/receiving device.

Open captioning is more common in commercial AV. Open captions are laid over top of video, such as a meeting or seminar. In these scenarios, the captioning crawl appears at the bottom of a display screen with presentation content. In live lecture and presentation environments, open captions not only serve the hearing-impaired, but also in scenarios where the content is presented in a language that is secondary for some audiences.

The speed and accuracy of speech-to-text conversion continues to improve with the emergence of deep neural network advances. The statistical algorithms associated with these advances, coupled with larger multi-lingual databases to mine, more effectively interpret — and accurately spell out — the speech coming through the microphone. That audio from the microphone is processed by a standalone system, consisting of a specialized speech-to-text engine which instantaneously outputs the speech as data on a display. This is typically the ideal open captioning architecture for singular classrooms, conference/meeting rooms and auditoriums, for example.

An alternative workflow can more efficiently process open captions to potentially hundreds of rooms on a campus. In this case, a dedicated device in each room would receive the audio and send to a private or public cloud server, which would then instantaneously output the speech as data back to the devices in each space. Those devices would distribute the captions with the video to the display in that room. This approach offers a very scalable architecture with dedicated devices serving each space. Meanwhile, the faster and more powerful processing of the computing engines within the captioning technology has significantly reduced the latency to near real-time.

New Efficiencies

As speech-to text conversion has grown faster and more reliable, feature sets have also expanded. One recent innovation is the introduction of multi-speaker identification, which isolates separate microphone feeds to reduce confusion from cross-talk.

Improvements in captioning technology have also been timely around emerging needs, including networks tasked with captioning large libraries of pre-recorded content. As more systems move to software-defined platforms, the captioning workflow for pre-recorded and/or long-form content has been greatly simplified.

For corporate and higher education facilities captioning existing content presented later over various platforms, media operators can essentially drag-and-drop video files into a file-based workflow that extracts the audio track for text conversion. These files can then be delivered in various lengths and formats for digital signage, online learning and other platforms. Furthermore, the integration of dictionaries into the captioning software is growing in significance to ensure the proper comprehension and presentation of unique terminology to serve the nature of each business. For example, the captioning dictionary will integrate banking terminology to serve financial institutions, and religious terminology to serve houses of worship.

Regardless of how many bells and whistles are needed, the overarching business benefit of updating the captioning workflow remains clear: The cost reduction opportunities multiply as these systems move to software-defined platforms. In some cases, these systems are also offered as SaaS platforms. The options continue diversify as the technology evolves.