Most people, when they hear about ASR or automatic speech recognition, immediately think of home devices, like Alexa or Google Home. They sit in a corner on a table or in the kitchen, and you can, if you speak loudly and clearly enough, order them to perform all kinds of tasks, such as playing music or turning on the lights or ordering a pizza.
We’ve all seen the commercials, after all. Say the command phrase. Give simple, direct instructions. Don’t hesitate or pause for too long. Speak clearly and loudly. And then the magic happens.
So, it makes a lot of sense that this is where the mind initially goes. These kinds of devices are becoming ubiquitous, and thus they are the sort of automatic speech recognition that most of us will encounter in day to day life.
They’re also very effortless and easy to use, so most experiences we have with them are pretty pleasant. (Unless you have an accent that the devices haven’t been trained to recognize!) This is partly due to the recent advances in speech recognition algorithms, which allow for the construction of small devices that can nonetheless recognize a wide range of spoken words and phrases.
It’s also due to some carefully designed marketing. We’ve been taught how to use our home devices, through commercials, product placement in television shows and movies, and explicitly through manuals and how-to videos. So, we find it as easy as flipping a lightswitch.
All that being said, there’s a lot about the way humans use language every day which home devices can’t handle (yet). For example, naturally and effortlessly, we have conversations with each other that often require previous knowledge or similar experiences in order for those conversations to be meaningful. We probably have dozens of conversations, if not hundreds, in a single day. But even the simplest conversation, the kind you might have with your co-workers when you arrive at work in the morning, is currently beyond the capability of today’s intelligent home devices.
There’s a couple of reasons for this. First, home devices are designed to handle voice-activated one-way communication only: that is, from us to the device. Second, the kinds of things we can communicate to these devices are limited to requesting an action to be performed, such as turn on the lights or tell me a joke. If you aren’t telling them to do something, then they either won’t respond or will give a generic error response.
Overall, more complex speech recognition applications require more advanced speech recognition software, such that historical experience, situational context and critical thinking can be applied in ways that humans do when engaging in conversation. With improved software capability, many more functions are available, including transcribing call conversations based on the broader context around the conversation, supporting call center QA over time, and providing customer insights that only experienced call center agents can provide after years on the job.