What is Automatic Speech Recognition (ASR) and How Does It Work?

Jack Maden October 21, 2022

ASR is a rapidly evolving field with great customer service potential. Discover how it works.

Over the past decade, advances in machine learning have led businesses to enhance their interactions with customers. 

One of the most exciting ways organizations are doing this is by making the most of automatic speech recognition (ASR).

In this article, we’ll provide a basic overview of what it is and how it works. 

What is Automatic Speech Recognition (ASR)?

Put simply, automatic speech recognition (ASR) is the use of digital software and hardware that processes human speech into written text. As a subfield of machine learning, it combines multiple realms of research like computer science, engineering, and linguistics.

ASR has many other names, like speech-to-text (STT), computer speech recognition, or even just speech recognition.

Understandably, this also means that there is some uncertainty about what ASR is and isn’t. 

Some organizations confuse ASR with natural language processing (NLP). Although the terms often go hand in hand, they are separate processes. Natural language processing (NLP) is about extracting the meaning of text data to fuel other actions. ASR, however, allows this to happen in the first place by converting speech data into text. 

ASR is also quite different to voice recognition. While the former has more to do with translating data, the latter only seeks to identify an individual user’s voice.

A brief history of ASR

Automatic speech recognition has existed since the 1950s, but it only started becoming more popular within the last decade or so. 

The first recorded instance of this technology was a system named “Audrey”, built by three researchers from Bell Labs (a well-known scientific development company). At the time, the system could only transcribe single-digit numbers. 

The next breakthrough happened around the late 1980s. Most researchers used artificial neural networks to power their ASR engines, which meant that they could differentiate between sounds more effectively. However, they were still not suitable for enterprise use cases, as they still couldn’t transcribe whole sentences very accurately. 

Fortunately, ASR is now more suitable for commercial use than ever. Rising expectations for accurate ASR have pushed vendors to improve the speed, accuracy, and flexibility of their speech-to-text solutions. 

For this reason, ASR continues to be a rapidly growing and evolving field. 

What types of organizations use ASR?

Automatic speech recognition can be used by any company. After all, easy access to transcriptions can help businesses make better-informed decisions, regardless of industry. 

Currently, ASR is most commonly used by customer-facing organizations, especially those with contact centers. By using a smart real-time engine, they can optimize their operations, improve their agents’ performance, and deliver better customer service—ultimately helping them to earn more revenue. 

Still, it should be noted that with recent advances in accurate ASR, more and more industries are finding unique uses for this technology. Healthcare organizations, for example, are now using ASR to transcribe phone appointments. Other industries that are leveraging this technology include finance, insurance, and BPO. 

How does ASR work?

ASR engines work by receiving audio input, processing it, and producing an output. 

The computer takes in the waveform of your speech. Then it breaks that up into words, which it does by looking at the micro pauses you take in between words as you talk.

The most advanced engines on the market today use recurrent neural networks (RNNs). Unlike artificial neural networks, RNNs keep a memory of previous predictions so transcriptions improve over time. 

RNNs need to process audio inputs (audio wave samples) as numeric data. In order for this to happen, a complex mathematical calculation is performed. 

What happens next depends on the type of ASR system. Some engines rely on a phonetics-based system, which means that the audio wave samples registered by the software won’t be converted and shown as a set of letters. The output will simply be a set of sounds and phonemes. 

By contrast, a full-text system like Voci goes one step further and compares the phonemes to a dictionary of known sounds or letters to determine what is being said in each sample. In addition, a language model is used to test the sequence of phonemes, words, and sentences. This is to make sure everything is as accurate and realistic as possible. 

What are the key features of an effective ASR engine?

There are many ASR vendors on the market, so choosing the right ASR engine can be a real challenge. Here are some of the most important features to look out for. 


One key feature that an effective ASR engine should have is speed. ASR engines have a knock-on effect on the technology solutions they’re powering. If speech data takes a long time to be converted into text data, organizations will have to wait longer before getting the benefits. Therefore, a real-time engine with a high-processing speed is ideal. 


With ASR, contact centers can grow at scale. That’s why they need an engine that can scale with them. Most ASR vendors don’t openly disclose how much their deployments actually cost, so it’s key that prospects do their research—and look at efficiency, total cost of ownership (TCO) and on-premise versus cloud deployments. 


In this field of technology, accuracy means how precisely the ASR engine is able to convert unique human voices into text. One metric that organizations commonly use to evaluate the performance of an engine is word error rate (WER). This is expressed as a percentage; the lower the WER is, the more accurate the ASR engine is.


Language is never uniform; contact center agents will inevitably come across differences in customers’ accents and dialects. Some customers may even prefer to use multiple languages at once. Because of this, the ideal ASR engine should be able to recognise and accurately transcribe different linguistic varieties. 

Voci: the world’s most accurate, scalable, and intelligent ASR engine

The field of ASR is transforming rapidly, and Voci is at the forefront of it. 

Built for scale, Voci is the world’s most efficient transcription engine, guaranteeing the lowest total cost of ownership (TCO) of any provider, as well as some of the fastest, smartest, and most accurate transcription available.  With support for 30 languages, contact centers can transcribe human voices with confidence and unlock the full potential of customer data from all over the world. 

Key takeaways

  • Automatic speech recognition (ASR) is the use of software and hardware that turns speech data into text. 
  • ASR can be used by all customer-facing organizations. Contact centers, in particular, can benefit from solutions that use ASR engines.
  • ASR engines work by converting audio wave samples into a set of numbers. The numbers are then matched with a set of words or phrases, producing the output. Language models are used to verify the accuracy and likelihood of the output. 
  • When choosing between ASR engines, look for key factors such as speed, scalability, accuracy, and flexibility. 

Jack Maden

Jack Maden is Director of Content Marketing at Medallia, with a focus on producing educational materials for the Automatic Speech Recognition (ASR) space.

Access our ASR API

With up to 1000 hours of audio at no charge