Analyzing Language to Train ASR Systems: Combining Linguistics with AI Technology
One of the biggest challenges that automated transcription faced, at least until the machine learning/AI revolution we’re all currently living in, was the sheer complexity of language. Basically, computing needed to get to a point where it could deal with the evolving and diverging ways that we use language.
Even though software has caught up and ASR exists, there are still so many complexities to language that human intervention is an essential part of the transcription process.
My area of the business is responsible for managing transcriptions, including working with transcription companies, to create the data that is used to build language/acoustic models and to train Voci’s ASR system.
Think of it like this. When you upload a photo to Facebook, it identifies (not always perfectly!) people in the photo, and tags them for you. The software which does that has to be given lots of examples of photos before it can function. And those photos all need to be labelled — first to identify which parts of the photos contain people, and second to identify which names go with which people.
Voci’s ASR system works in the same sort of way. Once the underlying acoustic and language models are built, and the system is trained, it can identify the words people are using, as well as other features, such as sentiment and emotion. But in order to build the models and train the system, you need labelled data.
That’s where I and my team come in. Our job is to ensure that the transcription data which is given to the teams that build the models and train the system is as accurate as possible. (We also work on specialized projects for our in-house research team, to support developing new products and new features. Sorry, I can’t talk about those in this blog!)
A linguistics education or linguistics training is incredibly helpful in order to do this effectively. Anyone who speaks a language can get some kind of transcription of a conversation (in that language) down onto paper. There is a difference, though, between the kind of transcripts that someone who can speak a language produces and the kind of transcripts needed for machine learning.
For example, we have to develop, and then follow and enforce, guidelines regarding when colloquial speech should be in the transcript as compared to formal speech. If you actually say the sentence, “Are you going to give me a sandwich?”, it might come out, depending on your accent and where you learned English, something like “Are ya gonna gimme a sanwich?”
Which is the accurate transcription? The first sentence is formally correct — it follows standard rules of English grammar and spelling. The second is also correct, in that it accurately reflects exactly which sounds were said. You have to make a decision, because you can’t have both in a transcript.
This is a simple example, but it’s the kind of thing we have to address and deal with, to ensure that we have consistent data, and thus that Voci’s ASR transcripts are also consistent and accurate. We need to be patient, analytical, and focused.
Sometimes, fortunately, we can have some fun with the unusual ways people use language. Let me finish with one of my favorite examples. We had an agency transcript come to us with the words “sharky darn”, and none of us could figure out what that was supposed to be. Eventually, we realized this was a mistranscription of the phrase “shucky darn”, that’s used in parts of the Midwestern states. It’s not a phrase I use, but I’ll probably never forget it!