Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications

Company uses hybrid models to improve things like wake words, speech to text and security on devices.
by Scott Martin

You may not know of Todd Mozer, but it’s likely you have experienced his company: It has enabled voice and vision AI for billions of consumer electronics devices worldwide.

Sensory, started in 1994 from Silicon Valley, is a pioneer of compact models used in mobile devices from the industry’s giants. Today Sensory brings interactivity to all kinds of voice-enabled electronics. LG and Samsung have used Sensory not just in their mobile phones but also in refrigerators, remote controls and wearables.

“What if I want my talking microwave to get me any recipe on the internet, to walk me through the recipe? That’s where the hybrid computing approach can come in,” said Mozer, CEO and founder.

Hybrid computing is the dual approach of using cloud and on-premises computing resources.

The company’s latest efforts rely on NVIDIA NeMo — a toolkit to build state-of-the-art conversational AI models — and Triton Inference Server for its Sensory Cloud hybrid computing unit.

Making Electronic Devices Smarter

Devices are getting ever more powerful. While special-purpose inference accelerators are hitting the market, better models tend to be bigger and require even more memory, so edge-based processing is not always the best solution.

Cloud connections for devices can deliver improved performance to these compact models. Over-the-air deployments of updates can apply to wearable devices, mobile phones, cars and much more, said Mozer.

“Having a cloud connection offers updates for smaller, more accurate on-device models,” he said.

This offers a payoff for many improvements to features on devices. Sensory offers its customers speech-to-text, text-to-speech, wake word verification, natural language understanding, facial ID recognition, and speaker and sound identification.

Sensory is also working with NVIDIA Jetson edge AI modules to bring the power of its Sensory Cloud to the larger on-device implementations.

Tapping Triton for Inference

The company’s Sensory Cloud runs voice and vision models with NVIDIA Triton. Sensory’s custom cloud model management infrastructure built around Triton allows different customers to run different model versions, deploy custom models, enable automatic updates, and monitor usage and errors.

It’s deployable as a container by Sensory customers for on-premises or cloud-based implementations. It can also be used entirely privately, with no data going to Sensory.

Triton provides Sensory a special-purpose machine learning task library for all Triton communications and rapid deployment of new models with minimal coding. It also enables an asynchronous actor pipeline for ease of new pipeline assembly and scaling. Triton’s dynamic batching assists for higher GPU throughput and performance analysis for inference optimization.

Sensory is a member of NVIDIA Inception, a global program designed to support cutting-edge startups.

Enlisting NeMo for Hybrid Cloud Models  

Sensory has expanded on NVIDIA NeMo to deliver improvements in accuracy and functionality for all of its cloud technologies.

NeMo-enhanced functions include its proprietary feature extractor, audio streaming optimizations, customizable vocabularies, multilingual models and much more.

The company now has 17 languages supported by NeMo models. And with proprietary Sensory improvements, word error rates are consistently outperforming the best in speech-to-text, according to the company.

“Sensory is bringing about enhanced features and functionality with NVIDIA Triton hardware and NeMo software,” said Mozer. “This type of hybrid-cloud setup offers customers new AI-driven capabilities.”


Image credit: Sensory