Can captions be generated automatically using speech recognition?

The audio content of multimedia presentations is inaccessible to people who are unable to hear. If there is content presented audibly, the accessibility solution is captioning that provides a synchronized text alternative to the audio track. For additional general information about captioning, see How do I make multimedia accessible?

Many organizations produce large quantities of videos for their distance learning programs, outreach, marketing, and other functions. Also, a growing number of institutions are turning to multimedia as a means of enhancing their Web-based curricula. The cost of captioning all this video and multimedia content has many institutions concerned and exploring their possibilities. Many institutions are outsourcing on an as-needed basis, but must be careful to ensure they can receive the accessible media in a timely fashion. Often prompt turnaround requires additional cost. Other institutions are developing the expertise to provide captioning in-house.

Researchers continue to explore options for automating portions of the captioning process. Some organizations are using products or services that utilize some degree of automated captioning.

The best-case scenario would be fully automated captioning using speech recognition technology. Unfortunately, current technology is not accurate enough to fully support this approach. However, research and development toward this goal has been fueled by a rapidly growing market for video search and archival systems. In order to archive and index digital multimedia so that users can search its content, at least a portion of that content needs to be text-based. The first company to utilize speech recognition in this market is Virage®, whose VideoLogger™ application uses speech recognition to capture text from a video, which it then uses to build a structured searchable index. However, because of the accuracy limitations of speech recognition, this tool cannot yet be used to generate entire caption tracks; it is used instead to extract sets of keywords, including only those words that the software can interpret with a high level of confidence.

A number of other companies have entered the market and are currently at various stages in developing semi-automated solutions. Perhaps most notable among these is Nuance® Dragon AudioMining™, from the maker of popular speech recognition product Dragon NaturallySpeaking®.

The greatest limitation with speech recognition technology is that it is only accurate in optimum situations, where the speaker has devoted time to training the software to recognize his or her speech patterns, where audio quality and the acoustics of the recording environment are excellent, and where distracting background noises are minimal. Few multimedia presentations meet these criteria. While fully automated captioning may not currently be possible, speech recognition can still play a significant role both in producing a transcript and creating captions from an existing transcript.

The first step in captioning multimedia is creating a transcript of the audio content. Speech recognition technology has become a widely used tool for transcriptionists. In a process called shadow speaking, the transcriptionist (who has trained the speech recognition software to understand his or her speech) simply speaks along with the audio, repeating what the speaker is saying. Products like the CPC-500 Voice Captioner™ support the shadow speaking process for real-time captioning of live events. Transcriptionists who are creating transcripts to be converted into captions will typically use an off-the-shelf speech recognition product such as Dragon NaturallySpeaking.

If a transcript already exists, products or services like CaptionSync™ by Automatic Sync Technologies can effectively use speech recognition to create captions from the existing transcript. This is possible, whereas fully automated captioning is not, because the speech recognition engine only needs to identify when a known word or phrase was spoken, which is a much easier task than identifying what what was spoken. CaptionSync is provided as a web-based service, where customers upload a video file and transcript, and within minutes receive a caption file via email.