Technical

Artificial Intelligence Without a Robot Uprising

The Intern's Guide to Everything STT - Part 3
Corinne Hebestreit January 05, 2019

Artificial intelligence or AI is an idea that’s been around for a long, long time. Artificial beings that can think like humans have been talked about since ancient times. Mary Shelley’s Frankenstein, created in the early 1800’s, is the most well-known historical example, but the stories go back many centuries earlier, including the ancient Greek myth of living golden statues created by the god Hephaestus. But that’s not exactly what we’re talking about when we talk about the AI involved in speech to text. The type of AI that would produce an intelligent robot which we could talk to, usually called strong AI, just doesn’t exist. (Yet.)

The kind of AI we have is more focused, and still pretty amazing. Artificial intelligence can be used to quickly perform tasks that humans find difficult or very time-consuming, such as rapidly transcribing speech in real-time.

To understand how this technology works, we have to start with machine learning. Instead, machine learning is something that software algorithms do. An algorithm is a mathematical term, which means a specific way of calculating a function. In machine learning, algorithms are given training data -- a sample set of relevant data selected by a human, such as a researcher or an engineer -- and build statistical models based on patterns in that data. These statistical models are then used to make predictions about data that is not in the training set. All this without being explicitly programmed to do anything with the new data! Machines which learn are figuring out the logic or pattern within the training data, and using that to make future decisions and solve problems.

Deep learning is a special kind of machine learning. The idea here is that the software or algorithms are going through multiple steps, or layers, before delivering the final output of the process. Each step performs a small part of a more complex task. For example, speech-recognition software might start with a layer that encodes the soundwave (transforming it from audio to numerical data), then another layer that matches the soundwave against stored phonemes, then another layer that matches those phonemes against patterns of words and sentences, all to produce a coherent English sentence as output. And each of these layers probably consists of many other, smaller layers -- you can see why this sort of learning is called “deep”.

A neural network is a form of AI that is capable of learning. It’s called a neural network because it consists of artificial neurons, which are bits of software capable of taking a numerical value, performing some calculations on it, and producing another numerical value. These are somewhat similar to the neurons in your brain, which is why they’re called that! The network part comes in that lots of these neurons can work together to process lots of numbers very quickly, and very effectively. A deep neural network is the kind of thing we’re talking about when we talk about ASR software like Voci’s. This sort of network is made up of many layers between the initial input and the final output. In other words, lots of calculating is being done between the first value -- such as an encoded piece of recorded speech -- that the network takes as input, and what the network produces -- such as a text transcription of the speech.

There’s a lot of different methods people use when they’re trying to teach “machines” (which are really software), such as neural networks, how to do something. Let’s start with supervised learning. Supervised learning is machine learning directed by humans. The algorithm is given a set of correct answers as the training data. It then tries to figure out what makes the right answers right. The important thing here is that the relevant data is already classified as relevant. The software starts with a clear indication of what is important, and what is not.

Unsupervised learning is the opposite. In this case, the machine learns from test data that has no information attached to it. The software is given no direction as to what is the correct or incorrect answer, and has to figure it out by identifying commonalities in the training data.

Semi-supervised learning is sort of a mixture of supervised and unsupervised learning, as some of the data used in this model is categorized, and some data is not. There are a couple of reasons for using this combination of methods. First, labeling all of the data to train algorithms takes up more time than anyone can possibly provide. And, second, it leaves too much human input on what information should be relevant for the algorithms, which can lead to missing features of the data that humans didn’t think were important (or didn’t notice at all)

That covers the basic terms for artificial intelligence. If you want to learn even more about the technology, and find out lots of acronyms to confuse and amaze your friends, click here for Part 4!

This blog is part of a series!



Corinne Hebestreit

Corinne's free time consists of boating on Pittsburgh's rivers in the summer months, and keeping her nose in a book no matter what time of year it is. She also uses her love of coffee as a valid excuse to avoid getting enough sleep.

Stay updated with Voci's speech insights

Please type your first name.
Please type your last name.
Invalid email address.
Invalid Input
I have read and agree with Voci’s Privacy Notice