When it comes to building software to do things that humans do easily and naturally, we run into challenges. Image recognition, self-driving vehicles and natural language processing (NLP) are three prominent examples that technology companies are currently working on, with varying degrees of success. The challenges reside in the fact that there’s a lot of features and complexities to how humans do these tasks which, normally, we don’t even notice. We just do them.
Take language, which is our focus here at Voci. We humans just speak, write, read and listen, and it all works (most of the time). We can understand each other, discuss issues, solve problems, and we can do it all pretty much on instinct. That is a tall order if you really think about it. Conversations between humans are, for the most part, free-form without strict rules, and involve many choices that could make a difference in someone’s life.
When we try to bring technology, especially artificial intelligence (AI), into the picture, we can’t rely on instinct any more. By any current measures, today’s software has no native instincts, and so we have to be much more conscious of how language is interpreted and acted upon. Deep learning and other advances in AI are fantastic technology, but they don’t work unless you have good training data to start with, and a lot of it. And when we are trying to use deep learning to go from speech to text, we have to be able to address as many of the features of our initial training data as possible, such as syntax and semantics. Otherwise, the transcription just won’t be right.
One important feature of language and conversations, in particular, is context. Being able to successfully and correctly use a language requires understanding context. When we listen to someone speak or read what someone has written, we are using other information, particularly information from the context of the situation, all the time.
For example, we use social context and cues to figure out what words are (probably) coming next. This is how we’re able to interrupt someone while they are speaking, even if the other person hasn’t actually finished yet. We anticipate what words are coming based on our understanding of how other people think, what they want, and why they are engaging with us — and all of this comes from our understanding of the social contexts in which words are used, including the specific context of the conversation we’re having right now.
This is part of what makes automatic speech recognition (ASR) challenging. Manufacturers of home devices, like Alexa or Google Home, get around it by limiting the speech their software has to deal with: just simple, direct instructions. If software systems are intended to handle broader and more complex circumstances — such as phone call-based conversations, call center quality assurance, or customer experience management — then they need to be correspondingly more sophisticated.
ASR software that’s built for call centers, for example, can’t just recognize words one-by-one and push them out into a transcript. This would inevitably lead to mistranscriptions. “Their”, “there” and “they’re” are all used in completely different contexts, and humans who are fluent in English know which is which. But, they all sound exactly the same. Without contextual information, an ASR engine will transcribe these incorrectly.
Similar problems emerge for terms that are used as slang, which is often regional or otherwise highly localized. For example, “yinz”, which you can hear in the Pittsburgh area, or the Boston-area use of “wicked” to mean “excellent”. If the ASR engine is being used for a high volume of Pittsburgh- or Boston-area callers, it needs to be able to recognize and correctly transcribe these terms. If it’s being used in a different context, then these words probably aren’t important.
The big implication for ASR relates to the language models that the engine uses when transcribing speech. Language models have to be built to handle a particular language and dialect — such as US English. (For more details on how language models are built, Catherine’s blog is an excellent start.) Additionally, however, language models have to be built for a particular domain, such as call centers or voicemail, so that context can be introduced for better transcription accuracy.
The importance of context means that there is no one language model that’s going to work for everything. Overall, to obtain meaningful and accurate speech to text transcriptions, language models have to be built for the specific context that they are going to be used in. It does little good to use training data from recorded conversations at a McDonald’s drive-through to build a language model for an ASR which will be used for a telehealth service or an emergency services call center.