Transcription 101

Bill von Hagen December 05, 2018

Like any high-tech field, STT (Speech-To-Text) shares many terms with the English language but gives them its own special twist. Take transcription, for example, which basically means writing something down. Writing something down is not that exciting a topic, until you consider that transcription in an ASR (Automatic Speech Recognition) system means writing down sounds after they have been automatically built up into words. The system processes input sound as it is spoken. Each uninterrupted chain of sounds that can be separately identified and analyzed (an utterance) provides one or more words. The words have individual meanings, but can be more accurately selected and understood when combined in context with the other words that are being built up and transcribed.

Understanding words in context is one of the things that you get from a language model used during speech recognition. A language model is a set of complex data structures that identify common terms and expressions used within the domain that is the target of the model, like a certain industry, and identify their relationships and the probability of those relationships. The words used in a language model provide the basis of a dictionary for that model, which is a set of terms and phrases that is more likely to be encountered in the domain addressed by a language model. Language models also identify the amount of space between words or phrases for them to be related, and so on. The combination of being able to identify words from the sounds in voice input that is being processed, being able to identify other words that provide a context for the use of those words, and to be able to identify those words from that context is one of the core aspects of the AI (artificial intelligence) that supports Voci products such as V-Blaze.

ASR systems like V-Blaze can also look beyond what is being said to how things are being said, deriving even more meaning from how sounds and words are being spoken. The tone and volume of adjacent words, the gender of the speaker, and the type of sentiment that terms express are all part of the equation and are optionally available as part of certain types of transcript. Complex transcriptions such as these contain term/value expressions in the open standard JSON (JavaScript Object Notation) file format that can be read and understood by many statistics and analysis applications. One level up from transcription, tools such as V-Spark read the JSON version of your transcriptions and let you search them for common expressions and characteristics, becoming an analytical speech browser across those transcriptions. We’ll cover more on speech browsing in an upcoming blog.

With the basic definitions of what and how covered, the next few blogs will venture into additional exploration of the technology that supports ASR systems, and specific examples of how Voci ASR can benefit certain applications and industries. These scenarios include transcribing and providing insights into customer service calls, generating subtitles on audio and video content, and much more. Talk to you soon!

Bill von Hagen

Bill von Hagen is a writer who loves Linux, spicy food, and classic AI and UNIX workstations, though not necessarily in that order.

Stay updated with Voci's speech insights