Navigating the Market: Getting Started with Automatic Speech Recognition

Kevin Stay August 12, 2019

If you’ve looked into Automatic Speech Recognition (ASR) or speech to text (STT), you know that the industry is constantly changing. Regardless of where your business is — investigating options, considering a replacement, or totally new to the field — you need a good plan in place in order to navigate the market.

In this blog, I’m going to take a look at some questions and ideas which will help you if you’re just starting out with STT. That is, you have some idea of what STT is, and you’re interested in a solution. The choice of vendors can have a meaningful impact on the type of solution you can offer — unlike some software solutions, speech to text is not just another commodity.

First things first: you should have a clear set of criteria that you’re going to use to evaluate potential vendors. There’s such a diverse range of options available in the market that you can quickly be sidetracked looking at technology that’s cool and interesting, but may not suit the needs of your business.

If you’re looking at STT to enable speech analytics capabilities, then you need to investigate how easily the STT solution can serve that need. Is there development time (and cost) involved in order to get the solution’s transcripts to work with your preferred speech analytics solution? Is there only a limited range of analytics vendors that the STT solution works with, or is the text provided in an open format?

Alternately, maybe you need to look at what your competitors are doing in the STT or speech analytics space. If you need to catch up, then you need a clear sense of what their capabilities are, and which solutions will give you a path to staying competitive.

Once you have defined your needs generally, it’s time to get specific. There are a lot of different features which STT vendors will put forward in order to differentiate their products. Not all of these are going to be useful to you. Thus, some decisions are going to need to be made.

One simple consideration to take into account: how many hours of calls will you need to have transcribed annually? This can have significant pricing implications, as some vendors will dramatically raise the price if you go over your hours. At the same time, you don’t want to be buying hours that you aren’t going to use. 

You also need to think about transcription speed, and how that works with speech or voice analytics. Some software can analyze many calls at once and return an output in a matter of seconds or minutes, while others require far more processing time.

Processing fast also means processing efficiently. An STT system that maximizes throughput efficiency is going to require far less hardware or cloud infrastructure. Even if you're not managing your own hardware or cloud, you will wind up paying those costs in the form of hourly transcription rates.

What speed you need depends on what kind of business you have. If you need to be able to respond quickly to issues as they develop, then you’ll need to focus on solutions designed for post-call that work efficiently, and can adapt for real-time when your business is ready for it. (A real-time solution requires extensive hardware infrastructure to access audio streams.) On the other hand, if longer-term trends, such as how customer sentiment is changing month to month, are what concerns you, then post-call probably makes more sense.

One final feature to consider is the architecture of the solution. Do you need a flexible architecture, which allows you to run STT either in the cloud or on-premise? Are you able to deploy in your own cloud infrastructure? Do you have compliance requirements which make an in-cloud solution unworkable? (On this point, let me note that there are vendors — including some big, household names — that require your STT solution run in the cloud.) What are the cost implications of going on-premise as opposed to running in the cloud?

While these may seem like points that only IT or the CFO would care about, it’s important not to go too far into buying a solution if it just won’t run the way you need it to.

Finally, you need to get very specific and very fine-grained on how you’re going to use speech analytics or STT. If you’re trying to get an STT solution in order to enable speech analytics, then what do you want speech analytics for? Do you have a particular vertical that you need to focus on? Are you interested in sentiment and emotion analysis? Are you looking for whether agents are using (or avoiding) certain key phrases? Do you want to improve agent coaching and training?

Whatever your use-case, or use-cases, you should know what they are, and look for vendors with expertise and experience in serving those needs.

Kevin Stay

Kevin Stay is an Account Executive at Voci Technologies. He previously worked as a pollster and campaign consultant on campaigns from city-wide ballot initiatives to campaigns for governor. In his free time, Kevin hosts pub-trivia and regularly performs with an improv comedy troupe.

Access our ASR API

With up to 1000 hours of audio at no charge