Clear Communication for Speech Recognition: The Importance of Information Representation

Kathy Brigham November 11, 2019

One of the most fascinating aspects of life is the act of communication. Without it, we could not interact with each other. And communication can occur in so many forms, from a simple wink to a loud cry. Even a long pause can convey some meaning. Communication can also take on many subtle qualities – it can be enlightening, deceptive, persuasive, offensive.

But at its core, communication is about the transfer of information. Of particular interest to me is the act of encoding and decoding the information that is transferred during acts of communication. Specifically, what’s the best way to represent the information that we want to communicate? What words should we use? Should we use any words at all, or communicate non-verbally?

And, on the other side of the communication, how does the receiver (i.e., the person we are talking to) then extract the information we were trying to convey? Have we chosen the best way to encode the information so that the receiver can easily decode it? Have we created confusion or been unclear?

One of the most prevalent ways to encode information is, of course, with language. For example, I can encode my thoughts into spoken words for an intended receiver, and this person will hear them, hopefully understand the meaning of these words, and then react accordingly. Already, this is a lot: thoughts are being encoded into speech, that speech is being received and encoded back into thoughts, and then action is being taken on the basis of those thoughts. Every step in the process requires lots of cognitive processing and neural activity, which can be beyond what even the most advanced computer can do. (This is an enormously complex topic. See past blog posts from Catherine, Gabe and Neil for further discussion.)

And in this act of verbal communication, there is also a lot of additional information being communicated through means other than my spoken words, including nonverbal communications during our interaction. This could include details of my posture, my facial expressions, gestures, and even the volume of my voice — all of which may aid in facilitating the successful communication of my thoughts.

In short, how information is encoded in communication makes a difference to whether or not the communication is successful. Spoken words are a huge part of how we successfully communicate with each other, on a one-on-one basis.

However, that’s not the only kind of communication we engage in. A lot of communication, especially in the business environment, is many-to-one — such as when customers talk to a company. Suppose, for example, that there are a million of us who want to express our thoughts — we all have tested a particular product and are giving our feedback to the company that made it.

It’s probably both time- and cost-prohibitive for humans to listen to and process the thoughts of every one of these million people. Fortunately, the company can employ software to assist and speed up the process.

We, the million people, encode our thoughts into spoken words and communicate them to the company by calling up and having our voices recorded. Those voice recordings can then be encoded by a software system and converted into a different representation, namely a sequence of numbers. (For example, numbers can be used to represent the sound pressure created by each voice at a particular instance in time.)

The company can then feed all these numbers to a computer, and ask the computer to automatically decode, or extract, information that the sequence of numbers represents. We could, for instance, run ASR and convert this sequence of numbers into a sequence of words and sentences – that is, into readable text.

Text — the words actually said — isn’t the only thing the company might want to know. There’s lots of other information being communicated when the million of us call the company, and the company might want that information as well. For example, the company might also be interested in the gender of the speaker. This could be used for marketing purposes, as marketing would probably want to know if the product did well with males, but wasn’t popular with females.

In order to do this successfully, the gender information has to be encoded in a form that allows the computer to decode it effectively. If the information is encoded in a format which the computer can’t work with, then marketing is going to be disappointed. (Just as, when I’m writing this blog, if I use terms you don’t understand to communicate my thoughts, you won’t decode the information.)

Waveforms are one way to encode speech. So, here’s an example waveform:

Waveform

Is this a male or a female voice? Do you have any idea what the gender is? I can’t tell — and I work on this for a living.

So, waveforms probably aren’t a good way to encode speech if the information we want to extract is gender. A better approach draws from how we humans identify gender based on voice in ordinary circumstances. We usually figure it out based on the pitch of the voice (i.e., the fundamental frequency). And pitch is usually a helpful indicator of gender because women generally produce a higher vocal pitch than men. 

Instead of representing speech as a waveform, then, we could represent it with the following bits of numerical information:

  • Typical pitch values are 120Hz for men and 200Hz for women.
  • The estimated pitch of this voice is 130Hz.

With this set of information, a computer can provide a better guess as to the gender of the speaker – it seems more likely now that the speaker is male.

(It’s important to note that this information isn’t more than what was already in the waveform. The pitch information could be extracted from the waveform, but representing it as a waveform means it’s represented in a way that’s very difficult to see.)

The representation of information is a critically important factor for successfully transferring information. The selection of representations affects whether you are able to convey the information you want, and whether the receiver of the communication can extract the information. When working on ASR software, as we do at Voci, we are always considering whether we are converting the information we have into a format that is most conducive to obtaining the desired result.

Kathy Brigham

Kathy Brigham, M.Eng, PhD, is a senior scientist working in research and development, whose responsibilities include the curation of scientific literature, prototype development (including using 3rd party software) / feasibility studies, and mathematics development / documentation. At Voci, she has primarily been focusing on voice biometrics and exploring speaker diarization technologies. She also has previous experience as a systems engineer in military defense, working in hardware, systems integration & test, system design, and diagnostic software, and also in academia as a teaching fellow in electrical engineering.

Access our ASR API

With up to 1000 hours of audio at no charge