Riva Delivers Conversational AI on GPUs

Editor’s note: The name of the NVIDIA Jarvis conversational AI framework was changed to NVIDIA Riva in July 2021. All references to the name have been updated in this blog.

When startup Kensho was acquired by S&P Global for $550 million in March 2018, Georg Kucsko felt like a kid in a candy store.

The head of AI research at Kensho and his team had one of Willy Wonka’s golden tickets dropped into their laps: S&P’s 100,000 hours of recorded and painstakingly transcribed audio files.

The dataset helped Kensho build Scribe, considered the most accurate voice recognition software in the finance industry. It transcribes earning calls and other business meetings fast and at low cost, helping extend S&P’s coverage by 1,500 companies and earning kudos from the company’s CEO in his own quarterly calls.

“We used these transcripts to train speech-recognition models that could do the work faster — that was a new angle no one had thought of. It allowed us to drastically improve the process,” said Kucsko.

It’s one example among many of the power of conversational AI.

What the Buzz Is All About

There are lots of reasons why conversational AI is the talk of the town.

It can turn speech into text that’s searchable. It morphs text into speech you can listen to hands-free while working or driving.

As it gets smarter, it’s understanding more of what it hears and reads, making it even more useful. That’s why the word is spreading fast.

Conversational AI is perhaps best known as the language of Siri and Alexa, but high-profile virtual assistants share the stage with a growing chorus of agents.

Businesses are using the technology to manage contracts. Doctors use it to take notes during patient exams. And a laundry list of companies are tapping it to improve customer support.

Covering the Waterfront of Words

“There is a huge surface area of conversations between buyers and sellers that we can and should help people navigate,” said Gabor Angeli, an expert in conversational AI at Square Inc., who described his company’s work in a session at GTC Digital.

Deloitte uses conversational AI in its dTrax software that helps companies manage complex contracts. For instance, dTrax can find and update key passages in lengthy agreements when regulations change or when companies are planning a big acquisition. The software, which runs on NVIDIA GPUs, won a smart-business award from the Financial Times in 2019.

China’s largest insurer, Ping An, already uses conversational AI to sell insurance. It’s a performance-hungry application that runs on GPUs because it requires a lot of intelligence to gauge a speaker’s mood and emotion.

In healthcare, Nuance provides conversational AI software, trained with NVIDIA GPUs and software, that most radiologists use to make transcriptions and many other doctors use to document patient exams.

Voca.ai deploys AI models on NVIDIA GPUs because they slash latency on inference jobs in half compared to CPUs. That’s key for its service that automates responses to customer support calls from as many as 10 million people a month for one of its largest users.

Framing the Automated Conversation

The technology is built on a broad software foundation of conversational AI libraries, all accelerated by GPUs. The most popular ones get lots of “stars” on the GitHub repository, the equivalent of “likes” on Facebook or bookmarks in a browser. They include:

Huggingface, 26.1k stars
Fast.ai, 17.8k stars
spaCy, 16.3k stars
Kaldi, 8.7k stars
DeepPavlov, 4.2k stars
ESPnet, 2.2k stars

To make it easier to get started in conversational AI, NVIDIA provides a growing set of software tools, too.

Kensho and Voca.ai already use NVIDIA NeMo to build state-of-the-art conversational AI algorithms. These machine- and deep-learning models can be fine-tuned on any company’s data to deliver the best accuracy for its particular use case.

When NVIDIA announced NeMo last fall, it also released Jasper, a 54-layer model for automatic speech recognition that can lower word error rates to less than 3 percent. It’s one of several models optimized for accuracy, available from NGC, NVIDIA’s catalog for GPU-accelerated software.

Say Hello to Riva, the Valet of Conversational AI

Today we’re rolling out NVIDIA Riva, an application framework for building and deploying AI services that fuse vision, speech and language understanding. The services can be deployed in the cloud, in the data center or at the edge.

Riva includes deep-learning models for building GPU-accelerated conversational AI applications capable of understanding terminology unique to each company and its customers. It includes NeMo to train these models on specific domains and customer data. The models can take advantage of TensorRT to minimize latency and maximize throughput in AI inference tasks.

Riva services can run in 150 milliseconds on an A100 GPU. That’s far below the 300ms threshold for real-time application and the 25 seconds it would take to run the same models on a CPU.

Riva Is Ready to Serve Today

Kensho is already testing some of the tools in Riva.

“We are using NeMo a lot, and we like it quite a lot,” said Kucsko. “Insights from NVIDIA, even using different datasets for training at scale, made crucial insights for us,” he said.

For Kensho, using such tools is a natural next step in tuning the AI models inside Scribe. When Kensho was developing the original software, NVIDIA helped train those models on one of its DGX SuperPOD systems.

“We had the data and they had the GPUs, and that led to an amazing partnership with our two labs collaborating,” Kucsko said.

“NVIDIA GPUs are indispensable for deep learning work like that. For anything large scale in deep learning, there’s pretty much not another option,” he added.