Audio Quality and Word Error Rate: How to Get the Best From Your Speech Recognition System
In the digital audio domain, audio quality is an assessment of the accuracy, fidelity, and intelligibility of audio output from an electronic device. Given that audio is the primary input to an Automatic Speech Recognition (ASR) system, audio quality is undeniably important to maximize the performance of such systems. The poorer the audio quality, the more difficult it is for ASR systems to transcribe, which will lead to a less accurate transcript.
Part of my responsibilities at Voci include assessing audio quality for our customers and making individualized recommendations to customers to optimize their audio quality to generate the most accurate transcripts.
So, in this first blog, I’m going to be talking about the best audio-quality practices to ensure that your ASR system performs at its best.
If you’re recording and producing music, no amount of post-processing can fix the problems arising from poorly recorded instruments. Whatever the cause of the problem (e.g., poorly-placed microphones, background noise, reverberation issues), if the original recording is of poor quality, there’s very little that can be done in post-production to fix these problems. This translates almost exactly to speech-to-text technology. Poor quality source audio impacts word error rate (WER) more critically than all other factors.
Here’s why. The human brain possesses the ability to focus auditory attention on a particular stimulus while filtering out a range of other stimuli. For example, a partygoer can focus on a single conversation in a noisy room. (Given the prevalence of this kind of example, the ability is often referred to as the “cocktail party effect”.)
ASR systems can’t do what we can – focus attention on a specific stimulus. It treats all stimuli as acoustic inputs. Introducing distorting acoustic backgrounds, such as in the cocktail-party example above, will significantly deteriorate ASR effectiveness. For some sources of background noise – TVs, radio/music, traffic noise, etc. – the recognition engine may be able to tune it out or work around it.
However, sources that share the same frequency range as speech are significantly more challenging. For example, consider the presence of nearby call center agents that are picked up by the microphone. A human could easily distinguish between the voices; an ASR system will find it very difficult. Any background noise that the ASR cannot precipitously and reliably filter out can adversely impact WER. Therefore, high-quality recordings are important.
Transcoding is the direct digital-to-digital conversion of one form of encoding to another – that is, changing a file from one format to another. Many call recording systems will do this to maximize the use of digital storage space.
In audio transcoding, there are four transcode types, each of which has different implications for the success of ASR transcription:
Lossless-to-lossless transcoding is the only safe & recommended form of transcoding because no audio information is lost during the process. For example, transcoding from a .wav file to a .flac file is an example of lossless compression, commonly used for saving disk space without compromising on quality. A 10-minute, mono .wav file at 8-bit/16 kHz is ~9.8 MB on disk, whereas the same file after flac compression is ~5.6 MB.
By contrast, both forms of to-lossy transcoding will produce decreased quality. And to make matters worse, compression artifacts are cumulative. This means that to-lossy transcoding will cause a progressive loss of quality with each successive transcoding pass, which is known as “digital generation loss”. This process is irreversible and is thus also called “destructive transcoding”. For this reason, transcoding between or into lossy formats is strongly discouraged, and will probably create problems with the automatic transcription process.
Lossy-to-lossless transcoding (aka upsampling) is even worse. It suffers from the worst of both worlds. This process starts with the originally poor audio quality of a lossy file and then adds the file size of an uncompressed file. Because the information loss incurred during the (destructive) transcoding that created the lossy file in the first place is permanent and irreversible, to-lossless transcoding the file just increases the file size without any improvement in quality.
This is a bit technical, but it’s an important feature when looking at call recording technology.
The goal in speech coding is to minimize the distortion at a given bit rate or minimize the bit rate at an acceptable level of distortion. However, the signal-noise ratio (SNR), an objective measure of this distortion, does not correlate well with perceived speech quality (https://ecs.utdallas.edu/loizou/cimplants/quality_assessment_chapter.pdf).
So, speech coder performance is typically measured using a subjective scoring method, which is called Mean Opinion Score (MOS). MOS is measured on a scale from 0-5. A value of 4.0-4.5 is referred to as “toll-quality” and represents complete satisfaction for the user. This is the normal value of Public Switched Telephone Networks (PSTN, the standard telephone network we all know). It is also the benchmark for most VoIP telephony service providers.
A MOS score at or below 3.6, even though it is still intelligible, is considered unacceptable by many users.
G711 is the ideal codec for optimum ASR performance. No codec can perform better than G711 in theory, as it offers the best quality, no compression and lowest algorithmic delays. In cases where G711 cannot be used, codecs should be used that are consistently rated above 4.0 on MOS evaluations, such as, G726, G722.1, and GSM-EFR.