Triton Inference Server Accelerates Microsoft Translator

When your software can evoke tears of joy, you spread the cheer.

So, Translator, a Microsoft Azure Cognitive Service, is applying some of the world’s largest AI models to help more people communicate.

“There are so many cool stories,” said Vishal Chowdhary, development manager for Translator.

Like the five-day sprint to add Haitian Creole to power apps that helped aid workers after Haiti suffered a 7.0 earthquake in 2010. Or the grandparents who choked up in their first session using the software to speak live with remote grandkids who spoke a language they did not understand.

An Ambitious Goal

“Our vision is to eliminate barriers in all languages and modalities with this same API that’s already being used by thousands of developers,” said Chowdhary.

With some 7,000 languages spoken worldwide, it’s an ambitious goal.

So, the team turned to a powerful, and complex, tool — a mixture of experts (MoE) AI approach.

It’s a state-of-the-art member of the class of transformer models driving rapid advances in natural language processing. And with 5 billion parameters, it’s 80x larger than the biggest model the team has in production for natural-language processing.

MoE models are so compute-intensive, it’s hard to find anyone who’s put them into production. In an initial test, CPU-based servers couldn’t meet the team’s requirement to use them to translate a document in one second.

A 27x Speedup

Then the team ran the test on accelerated systems with NVIDIA Triton Inference Server, part of the NVIDIA AI Enterprise 2.0 platform announced this week at GTC.

“Using NVIDIA GPUs and Triton we could do it, and do it efficiently,” said Chowdhary.

In fact, the team was able to achieve up to a 27x speedup over non-optimized GPU runtimes.

“We were able to build one model to perform different language understanding tasks — like summarizing, text generation and translation — instead of having to develop separate models for each task,” said Hanny Hassan Awadalla, a principal researcher at Microsoft who supervised the tests.

How Triton Helped

Microsoft’s models break down a big job like translating a stack of documents into many small tasks of translating hundreds of sentences. Triton’s dynamic batching feature pools these many requests to make best use of a GPU’s muscle.

The team praised Triton’s ability to run any model in any mode using CPUs, GPUs or other accelerators.

“It seems very well thought out with all the features I wanted for my scenario, like something I would have developed for myself,” said Chowdhary, whose team has been developing large-scale distributed systems for more than a decade.

Under the hood, two software components were key to Triton’s success. NVIDIA extended FasterTransformer — a software layer that handles inference computations — to support MoE models. CUTLASS, an NVIDIA math library, helped implement the models efficiently.

Proven Prototype in Four Weeks

Though the tests were complex, the team worked with NVIDIA engineers to get an end-to-end prototype with Triton up and running in less than a month.

“That’s a really impressive timeline to make a shippable product — I really appreciate that,” said Awadalla.

And though it was the team’s first experience with Triton, “we used it to ship the MoE models by rearchitecting our runtime environment without a lot of effort, and now I hope it becomes part of our long-term host system,” Chowdhary added.

Taking the Next Steps

The accelerated service will arrive in judicious steps, initially for document translation in a few major languages.

“Eventually, we want our customers to get the goodness of these new models transparently in all our scenarios,” said Chowdhary.

The work is part of a broad initiative at Microsoft. It aims to fuel advances across a wide sweep of its products such as Office and Teams, as well as those of its developers and customers from small one-app companies to Fortune 500 enterprises.

Paving the way, Awadalla’s team published research in September on training MoE models with up to 200 billion parameters on NVIDIA A100 Tensor Core GPUs. Since then, the team’s accelerated that work another 8x by using 80G versions of the A100 GPUs on models with more than 300 billion parameters.

“The models will need to get larger and larger to better represent more languages, especially for ones where we don’t have a lot of data,” Adawalla said.