Technical

Automatic Speech Recognition and You

The Intern's Guide to Everything STT - Part 1
Corinne Hebestreit January 05, 2019

When I first started out at a new internship at Voci -- which, as you probably know by now, creates speech transcription software -- I was immediately introduced to terms I had never heard before. I needed to know things like “API” and “AI”, and the difference between a “speech browser” and a “speech engine”. If you have heard these words bouncing around and also don’t know what they mean, you have come to the right place!

Let’s start with the term right in the title: Automatic Speech Recognition or ASR. ASR takes natural speech and converts it into text. This is also called Speech-to-Text or STT. An ASR system is also called a speech engine. Whatever name you use, an ASR system can convert speech to text either on archived audio recordings, or pretty close to when conversations are actually going on. The name for this is real-time transcription.

There are two basic kinds of ASR system: phoneme-based and language-based or full-text. Both start with an acoustic model. An acoustic model teaches the speech engine the relationships between audio signals and phonemes. Now, a phoneme is the smallest spoken unit that distinguishes one word from another in a language. But not every sound that is made during speech is a phoneme. An acoustic model is what the speech engine needs to tell the difference between random or meaningless sounds, and phonemes. Phoneme-based ASR systems stop here -- all they can do is recognize phonemes. Full-text systems have two more parts.

First, a full-text speech engines needs a language model. A language model helps determine which words are most likely to follow a sequence of already recognized words. Without this, the speech engine would be trying to recognize each word individually. A language model helps the speech engine use context to figure out the word it’s hearing.

Second, a full-text speech engine needs a dictionary that contains the utterances that the computer is able to process when spoken by humans. Utterances are unbroken streams of speech that are a basic unit of recognizing and transcribing speech. These are words or phrases, and more than just phonemes! Full-text ASR systems now recognize utterances with ever-increasing accuracy, and the more utterances that are in the dictionary, the better they are at transcribing.

You sometimes hear the acronym LVSCR being used when talking about dictionaries and speech to text. LVCSR stands for Large Vocabulary Continuous Speech Recognition. LVCSR compares the audio data to a very large set of words, in order to recognize the utterances. LVCSR is thus able to deliver a transcription with recognizable words and sentences.

Another similar acronym is NLP, which stands for Natural Language Processing. Natural language processing is a form of computer science that relates to how computers interact with natural languages. Or, putting it another way, how computers can process and analyze speech.

Improving transcription accuracy is essential to gaining insight from transcriptions that you collect. Many different factors can negatively affect the accuracy of a transcription, such as a call where one speaker is in a noisy environment, or where the phone connection is poor. Tuning is the name given to the process of training or adding to your speech engine in order to produce more accurate transcripts. There’s lots of different ways to do this. Hinting enables ASR systems to automatically avoid errors by providing hints as to what is likely to be said. For example, the system can be given a set of product names. Substitution is used to automatically replace words which are transcribed consistently but incorrectly, such as “data birth” instead of “date of birth”. Custom language modelling trains the software to learn a whole range of new terms and pronunciations, and is based on a set of actual audio recordings from the business.

One more term that gets mentioned when it comes to recording speech: single audio channel. This means that, when a phone call is recorded for transcription, a call agent and customer are recorded on the same audio channel. Multi-channel is the opposite: agent and customer are both recorded on separate channels, and then combined into an audio file. (These are more commonly known as mono audio and stereo audio, respectively.) It is more difficult to transcribe audio where speakers must be identified first, as is the case when mono audio contains multiple speakers.

That’s probably enough explanation for now! If you’re interested in learning more about how this all works, click here for Part 2!

This blog is part of a series!



Corinne Hebestreit

Corinne's free time consists of boating on Pittsburgh's rivers in the summer months, and keeping her nose in a book no matter what time of year it is. She also uses her love of coffee as a valid excuse to avoid getting enough sleep.

Stay updated with Voci's speech insights

Please type your first name.
Please type your last name.
Invalid email address.
Invalid Input
I have read and agree with Voci’s Privacy Notice