Microsoft Teams helps students and professionals worldwide follow along to online meetings with AI-generated live captions and real-time transcription — features that are getting a boost from NVIDIA AI computing technologies for training and NVIDIA Triton Inference Server for inference of speech recognition models.
Teams enables communication and collaboration worldwide for nearly 250 million monthly active users. Teams conversations are captioned and transcribed in 28 languages using Microsoft Azure Cognitive Services, a process that will soon run crucial compute-intensive neural network inference on NVIDIA GPUs
The live captions feature helps attendees follow the conversation in real time, while transcription features help attendees provides an easy way to later revisit good ideas or catch up on missed meetings.
Real-time captioning can be especially useful for attendees who are deaf or hard of hearing, or who are non-native speakers of the language used in a meeting.
Teams uses Cognitive Services to optimize the speech recognition models using the NVIDIA Triton open-source inference serving software.
Triton enables Cognitive Services to support highly advanced language models, delivering highly accurate, personalized speech-to-text results in real time, with very low latency. Adopting Triton ensures that the NVIDIA GPUs running these speech-to-text models are used to their full potential, reducing cost by giving customers higher throughput using fewer computational resources.
The underlying speech recognition technology is available as an API in Cognitive Services. Developers can use it to customize and run their own applications for customer service call transcription, smart home controls or AI assistants for first responders.
AI That Hangs Onto Every Word
Teams’ transcriptions and captions, generated by Cognitive Services, convert speech to text as well as identify the speaker of each statement. The model recognizes jargon, names and other meeting context to improve caption accuracy.
“AI models like these are incredibly complex, requiring tens of millions of neural network parameters to deliver accurate results across dozens of different languages,” said Shalendra Chhabra, principal PM manager for Teams Calling and Meetings and Devices at Microsoft. ”But the bigger a model is, the harder it is to run cost-effectively in real time.”
Using NVIDIA GPUs and Triton software helps Microsoft achieve high accuracy with powerful neural networks without sacrificing low latency: the speech-to-text conversion still streams in real time.
And when transcription is enabled, individuals can easily catch up on missed material after a meeting has concluded.
Trifecta of Triton Features Drives Efficiency
NVIDIA Triton helps streamline AI model deployment and unlock high-performance inference. Users can even develop custom backends tailored to their applications. Some of the software’s key capabilities that enable the Microsoft Teams captions and transcription features to scale to a larger number of meetings and users include:
- Streaming inference: NVIDIA and Azure Cognitive Services worked together to customize the speech-to-text application with a novel stateful streaming inference feature that can keep track of prior speech context for improved, latency-sensitive caption accuracy.
- Dynamic batching: Batch size is the number of input samples a neural network processes simultaneously. With dynamic batching in Triton, single inference requests are automatically combined to form a batch, better using GPU resources without impacting model latency.
- Concurrent model execution: Real-time captions and transcriptions require running multiple deep learning models at once. Triton enables developers to do this concurrently on a single GPU, even with models that use different deep learning frameworks.
Watch NVIDIA CEO Jensen Huang’s keynote presentation at NVIDIA GTC below.