Speech recognition has come a long way from its clunky beginnings to today, now used to interact with everything from cellphones to cars to computers.
But even faster, even more accurate speech recognition could be on the horizon, thanks to researchers from Carnegie Mellon and Google Brain.
The researchers took a new approach to speech recognition, William Chan, a Ph.D. student at Carnegie Mellon University, told a crowd at the GPU Technology Conference Tuesday.
“We threw away the conventional speech recognition pipeline and replaced it with a simple model,” said Chan.
Most speech recognition applications require a complex, multi-step process to turn speech into text. For example, they must include a pronunciation dictionary (and experts to create them) that defines each sound in each word, according to Chan, who is lead author on a paper describing the research.
Although most speech-recognition applications use deep learning — training their neural networks to understand language — the CMU-Google method takes it a step further by removing the expert from the equation.
“Our model is completely data-driven. It learns from the acoustics (the speech) directly,” Chan said. It learns words associated with the sounds from human-created transcriptions. Once it’s trained on enough transcribed text, it can process sound and translate it to words on its own.
Accuracy Rate Equals the Best
In tests, the CMU-Google tool topped or equaled the accuracy rate of state of the art speech-recognition systems at the time, according to the paper.
Because the CMU-Google tool doesn’t require data-heavy elements, it’s ideal for mobile use, Chan said.
“Our goal is to directly turn acoustics into English characters,” Chan said. “It’s a simple, direct model.”
The other authors on the paper were Navdeep Jaitly, Quoc Le and Oriol Vinyals, all of Google Brain. Google Brain is one of the many deep learning efforts that requires the power supplied by GPUs.