Do you recognize this sentence? If so, chances are very good that you are a speech technology researcher. Mostly likely your specialty is speech synthesis, i.e., text-to-speech (TTS). Or perhaps you toil away in the related field of voice transformation, the art of making a speech synthesizer take on the qualities of a different and particular speaker, given some samples of his or her speech.
The quoted sentence about this Dangerous Trail and the mysterious Philip Steels is the first sentence from the CMU Arctic Speech databases.
This past weekend, I had the good fortune of traveling to Vienna, Austria to attend the 10th Speech Synthesis Workshop put on by ISCA, the International Speech Communication Association. While there attending SSW10, I had the extra good fortune of receiving an award, sixteen years after the fact, as co-creator of the Arctic speech databases.
Gérard Bailly of the GIPSA-lab in France (left) and John Kominek (right)
In the photo, that’s me to the right. On the left is Gérard Bailly, of the GIPSA-lab Laboratorie de Recherche located in Grenoble, France. Gérard is one of the field’s most prominent researchers and is an ever-present, instrumental part of the community. In 1990, he chaired SSW1, the very first ESCA Workshop on Speech Synthesis, held in Autrans, France. It was a real privilege therefore to stand next to him on stage while receiving the certificate for the “Best Paper Award from SSW5” and shake his hand, as well as that of the lead organizer, Michael Pucher of the Acoustics Research Institute in Austria.
Michael Puchar of the Acoustics Research Institute in Austria
To provide a little background: every year ISCA (once called ESCA for European Speech Communication Association), organizes a large annual gathering of researchers from around the world at a conference called InterSpeech (once called EuroSpeech). In addition to the main conference, several special interest groups hold satellite sessions. The speech synthesis special interest group meets every three to four years.
This being the 10th edition of the workshop, the organizers thought it appropriate to look back at all the preceding years, selecting the one paper per occasion judged to have had the most longevity and impact on the field. Thus, in total, nine awards were given out.
By my counting, five papers were awarded on grounds of introducing a new important algorithm. Three papers were awarded on grounds of providing a software toolkit useful to researchers. And one paper – mine – was awarded for creating data useful to the research community.
(For the most influential paper of this year’s workshop, they told me: “come back in 30 years and we’ll let you know.”)
Traveling back in time a little, the 5th SSW workshop was held in 2004, and was hosted in Voci’s home city of Pittsburgh. Motto of the event: Pittsburgh, the City with TTS at its Heart. It took me a while to spot the embedding: PI-TTS-BURGH.
The 2004 workshop was chaired by professor Alan W Black of the Language Technologies Institute of Carnegie Mellon University, and, being one of his PhD students at the time, I helped organize the event.
I also had what’s called a “poster”. In the poster, I described the Arctic Databases: their purpose, construction, and characteristics. I recorded the initial set of six Arctic databases the previous year in the School of Engineering's Speech Lab soundproof booth, provided by professor Richard M. Stern of the Electrical Engineering department, using Sennheiser MD431 near-field microphones generously loaned by Susanne (Suzi) Burger of the Interactive Systems Lab, also at CMU.
So, what are these databases anyway? The CMU Arctic Speech Synthesis databases are a set of free single-speaker recordings of 1,132 short sentences, designed to provide a phonetically balanced representation of spoken English in a collection of audio of about one hour in length. The audio was recorded using professional-grade sound boards and pre-amps at 32,000 Hz. To ensure free use of the recordings, the sentences were selected from out-of-copyright books.
And, giving the sessions a unique twist, the speakers were hooked up to an electroglottograph. Basically, two electrodes were affixed to each side of the speaker’s Adam’s Apple to record the actions of the vocal folds by sending a slight electrical current through the larynx.
We built some example voices out of the audio, put the databases online – still available here http://www.festvox.org/cmu_arctic/ – and invited other researchers to try their hand at them. They quickly became popular. Notably, in 2005, they were inaugurated into the first annual Speech Synthesis ”Blizzard Challenge”, an event that is still going strong.
Alan Black came up with the idea of publishing the recordings as open source data. Having been a leading researcher in the field for years he sensed that the field could benefit from independent scientists being able to use and publish against the same raw material. He was spot on. Arctic was the right data at the right time.
As well, in a way we didn’t anticipate at the time of creation, the databases have proved very popular within the voice transformation community. This has given the data a useful lifespan beyond what we had expected, considering that to build high-quality TTS, tens of hours of audio from the same speaker (beyond what the databases contain) are typically required.
And, proving that sometimes an apparent weakness is really a strength, the one hour sizing as a design choice has worked in favor of Arctic’s popularity. Since we could not afford to hire professional voice talents, but instead had to make do with volunteers drawn from the university, the data set needed to be recordable in a single day’s effort. This kept the barrier low for recording additional speakers.
I am sometimes asked: why the name “Arctic”? And, who is “Philip Steels” and what is he doing on the “Danger Trail”? People have debated how to even parse the first Arctic sentence, quoted above. Who is the author of what, exactly?
Let me put the answer on the record.
I mentioned that the prompts were selected from out-of-copyright books. To find said books I resorted to Project Gutenberg. By happenstance of what was downloaded, I noticed a strong representation of stories of adventure set in the far North, written by such turn-of-the-century authors as Jack London (The Call of the Wild, 1903; To Build a Fire, 1908), and James Oliver Curwood (The Danger Trail, 1910; Philip Steele of the Royal Northwest Mounted Police, 1911; The Flower of the North, 1912, etc.).
So, James Oliver Curwood is the author of The Danger Trail and Philip Steele. (How Steele got mis-transcribed into Steels, I am not sure.) In a nod to the Yukon theme, I named the databases “Arctic”.
Voci, of course, produces fantastic speech recognition products. We do not sell into the near-neighbor market of speech synthesis. But in their respective way, success in each depends heavily on the quality and quantity of data from which the AI models are constructed.
Data for AI in industry differs from data in academia in a few ways. For starters, the quantity is much larger. At Voci we utilize, and have to find ways to manage, hundreds of thousands of hours of training audio. Several people maintain the transcripts along the processing path, which works best when the data is treated as an asset as valuable as software, and placed under source control.
In addition, on many occasions we work with major customers to create customized models. Ensuring that privacy is maintained requires putting in place the proper protocols and security controls, as typified by PCI standards. (You can read some of Jay’s thoughts on security and privacy elsewhere on this blog.)
A final thought before I end this blog. It hardly needs to be stated that recording a collection of audio files is not in itself a major advance worthy of an award. But data-driven science requires data, and Arctic was the right data at the right time. I happened to be at the right place at the right time. Its real value lay in how it enabled a community of other researchers to do great work. More than one graduate student, I am sure, has finished their dissertation using data I helped create.
I don’t mind having that legacy at all.
Reflecting on it, it’s as true for me today as then. As the Chief Technology Officer of Voci, I see my role as that of enabling the great engineers in our company to do their best work. To make that happen, I’m as obsessive about speech data as ever.
Having entertained this interlude, I now return you to the regular writers and editors of this blog, who say: “Gad, your letter came just in time.”
And that is Arctic prompt a0008.