I’ve Transcribed Speech, But What Can I Do With It?

The Intern's Guide to Everything STT - Part 2
Corinne Hebestreit January 05, 2019

There’s a whole set of terms which we use to talk about doing more than just plain transcription. This gets us into a set of features in ASR software that identify the context of speech, as well as identifying speakers and allowing for classification of calls. You should know that a lot of these functions are much easier for full-text ASR systems to do. 

For example, speech transcription systems can perform sentiment labeling. The system can analyze calls to determine whether customers and agents are expressing positive or negative sentiments. This allows companies to quickly identify whether customers calling in are happy or unhappy.

Emotional analysis is similar. Emotional analysis allows a company to pinpoint the type of emotion -- so, is the customer experiencing negative or positive emotions, or is the customer’s emotional state worsening or improving. With this information, companies can develop policies to encourage behavior that is good for business, and discourage behavior that isn’t.

And then there’s gender identification. While it’s obviously not perfect, an ASR system can predict, with a very high degree of accuracy, whether the speaker is male or female, just based on the pitch of the voice.

These forms of analysis can be performed by an ASR system in a couple of ways. The first way is by acoustic analysis. This works by measuring specific features of the audio, such as tone of voice, pitch, volume, intensity and rate of speech. The idea is that by looking at audio features, the system can determine the sentiment, emotion or gender of the speaker. For example, someone who is angry may speak faster or louder or with a higher pitch, which could be identified as negative in the transcript.

The second is by linguistic analysis. This works by focusing on explicit indications and context within the audio. That is, speakers have a higher probability of using specific words or phrases in a particular order, depending on their emotional state. For example, “I was confused” and “cancel my account” are phrases that would be flagged as negative.

ASR systems can do a lot more than identify sentiment, emotion and gender. Everyone’s voice has particular speech characteristics which identify who the person is. Have you ever answered the phone and immediately known who the caller was by the sound of his or her voice? Maybe that person talks super-fast or with a very low register. The same applies here. Gender, age, and dialect are made apparent through one’s voice, and software can use the information in the audio data to precisely identify who the speaker is.

This is related to the process of diarization or speaker separation. Phone calls usually involve two people. So it has to be possible to recognize a change in speakers when producing a transcript, in order to identify what was said by the agent and what was said by the customer. This feature produces transcriptions that are easy to read because it is easy to identify when certain speakers were talking.

Finally, let’s talk about some additional categorization terms: metadata and call tagging. Metadata is data that isn’t directly part of the file -- it’s not the audio and it’s not the text transcription of the audio. Metadata is information which describes the file, and gives it context. For example, metadata can include the names of the agent and the caller. Depending on the software a call center is using, there may be some automatically created metadata, such as names, whenever customers call in. Event-level metadata is a specific kind of metadata that applies to utterances (which are the “events” in the term). One example is how confident you can be that the transcript is accurate to the utterance -- a higher percentage meaning that the transcript is more likely to be right.

Call tagging is like tagging a blog post (like this one!). Call tags can be manually applied to calls to allow them to be searched later. Maybe your call center agents are supposed to use a specific greeting when customers call in. You could put a tag on all the calls where the agents used the greeting, and a different one where they didn’t, and then use the tags later to quickly search for all examples. Once a call is tagged, you can just search for that tag, rather than having to go through the transcripts all over again.

That covers the basic terms for ASR. If you want to learn more about the technology underneath all of this -- artificial intelligence -- then click here for Part 3!

This blog is part of a series!

Corinne Hebestreit

Corinne's free time consists of boating on Pittsburgh's rivers in the summer months, and keeping her nose in a book no matter what time of year it is. She also uses her love of coffee as a valid excuse to avoid getting enough sleep.

Stay updated with Voci's speech insights

Please type your first name.
Please type your last name.
Invalid email address.
Invalid Input
I have read and agree with Voci’s Privacy Notice