Speak Like a Native: NVIDIA Parlays Win in Voice Challenge

Team’s text-to-speech AI model synthesizes a speaker’s voice into any of seven languages using a short clip of them talking and the text of what they want to say.
by Rick Merritt

Thanks to their work driving AI forward, Akshit Arora and Rafael Valle could someday speak to their spouses’ families in their native languages.

Arora and Valle — along with colleagues Sungwon Kim and Rohan Badlani — won the LIMMITS ’24 challenge which asks contestants to recreate in real time a speaker’s voice in English or any of six languages spoken in India with the appropriate accent. Their novel AI model only required a three-second speech sample.

The NVIDIA team advanced the state of the art in an emerging field of personalized voice interfaces for more than a billion native speakers of Bengali, Chhattisgarhi, Hindi, Kannada, Marathi and Telugu.

Making Voice Interfaces Realistic

The technology for personalized text-to-speech translation is a work in progress. Existing services sometimes fail to accurately reflect the accents of the target language or nuances of the speaker’s voice.

The challenge judged entries by listening for the naturalness of models’ resulting speech and its similarity to the original speaker’s voice.

The latest improvements promise personalized, realistic conversations and experiences that break language barriers. Broadcasters, telcos, universities, as well as e-commerce and online gaming services are eager to deploy such technology to create multilingual movies, lectures and virtual agents.

“We demonstrated we can do this at a scale not previously seen,” said Arora, who has two uses close to his heart.

Breaking Down Linguistic Barriers

A senior data scientist who supports one of NVIDIA’s biggest customers, Arora speaks Punjabi, while his wife and her family are native Tamil speakers.

It’s a gulf he’s long wanted to bridge for himself and others. “I had classmates who knew their native languages much better than the Hindi and English used in school, so they struggled to understand class material,” he said.

The gulf crosses continents for Valle, a native of Brazil whose wife and family speak Gujarati, a language popular in west India.

“It’s a problem I face every day,” said Valle, an AI researcher with degrees in computer music and machine listening and improvisation. “We’ve tried many products to help us have clearer conversations.”

Badlani, an AI researcher, said living in seven different Indian states, each with its own popular language, inspired him to work in the field.

A Race to the Finish Line

The initiative started nearly two years ago when Arora and Badlani formed the four-person team to work on the very different version of the challenge that would be held in 2023.

Their efforts generated a working code base for the so-called Indic languages. But getting to the win announced in January required a full-on sprint because the 2024 challenge didn’t get on the team’s radar until 15 days before the deadline.

Luckily, Kim, a deep learning researcher in NVIDIA’s Seoul office, had been working for some time on an AI model well suited to the challenge.

A specialist in text-to-speech voice synthesis, Kim was designing a so-called P-Flow model prior to starting his second internship at NVIDIA in 2023. P-Flow models borrow the technique large language models employ of using short voice samples as prompts so they can respond to new inputs without retraining.

“I created the model for English, but we were able to generalize it for any language,” he said.

“We were talking and texting about this model even before he started at NVIDIA,” said Valle, who mentored Kim in two internships before he joined full time in January.

Giving Others a Voice

P-Flow will soon be part of NVIDIA Riva, a framework for building multilingual speech and translation AI software, included in the NVIDIA AI Enterprise software platform.

The new capability will let users deploy the technology inside their data centers, on personal systems or in public or private cloud services. Today, voice translation services typically run on public cloud services.

“I hope our customers are inspired to try this technology,” Arora said. “I enjoy being able to showcase in challenges like this one the work we do every day.”

The contest is part of an initiative to develop open-source datasets and AI models for nine languages most widely spoken in India.

Hear Arora and Badlani share their experiences in a session at GTC next month.

And listen to the results of the team’s model below, starting with a three-second sample of a native Kannada speaker:

 

Here’s a similar-sounding synthesized voice reading the first sentence of this blog in Hindi:

 

And then in English:

See notice regarding software product information.

Explore generative AI sessions and experiences at NVIDIA GTC, the global conference on AI and accelerated computing, running March 18-21 in San Jose, Calif., and online.