Accelerating AI Inference Performance in the Data Center and Beyond

As AI deep learning continues to rapidly evolve, so do the kinds of challenges businesses seek to solve.
by Dave Salvator

Inference is the technology that puts sophisticated neural networks — trained on powerful GPUs — into use solving problems for everyday users.

Most inference work has been focused on “after hours” large-batch, high-throughput work done on large numbers of CPU servers. But that’s changing, fast.

Going forward, the trends in inference are toward sophisticated real-time services based on more complex models for speech recognition, natural language processing and translation — areas where low latency is critical.

The NVIDIA Tesla platform can deliver significant performance upsides — for both training and inference, the two critical, compute-intensive operations at the core of deep learning — as well as massive acquisition and energy cost savings.

As deep learning spreads, CPU-only servers can’t scale to meet this demand. A single GPU-equipped server can replace 160 CPU-only servers. And it can deliver higher inference throughput with low latencies to not only meet the demands of this trend, but accelerate them.

Introducing TensorRT 3

Inference performance is about more than just speed. In fact, there are four key aspects that must be considered to get a complete inference performance picture: throughput, energy efficiency, latency and accuracy.

To maximize the inference performance and efficiency of NVIDIA deep learning platforms, we’re now offering TensorRT 3, the world’s first programmable inference accelerator. It compresses, optimizes and deploys a trained neural network as a runtime to deliver accurate, low-latency inference, without the overhead of a framework.

TensorRT features include:

Weight & Activation Precision Calibration: Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss

Layer & Tensor Fusion: Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution

Kernel Auto-Tuning: Optimizes execution time by choosing the best data layer and best parallel algorithms for the target: Jetson, Tesla or DRIVE PX GPU platform

Dynamic Tensor Memory: Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration its usage

Multi-Stream Execution: Scales to multiple input streams by processing them in parallel using the same model and weights

Throughput/Latency/Accuracy: Low latency is critical to delivering real-time inference-based services. A CPU-only server cannot deliver inference throughput at 7ms, and in this case has a latency of 14ms. Tesla GPUs deliver massive inference performance speedups, up to 40x, and deliver that speedup in 7ms. In addition, TensorRT 3 offers optimized precision to deliver inference at INT8 and FP16 with near-zero accuracy loss.

For speech-based usages, researchers have recently described a higher latency threshold of around 200ms as being acceptable. OpenNMT is a recurrent neural network (RNN) for handling translation, in this example from English to German.

Speech Throughput and Latency: NVIDIA inference platforms deliver up to 150x more throughput with less than half the latency versus CPU-only servers, and bring all this well under the targeted latency threshold of 200ms.

A platform built for deep learning must have three distinct qualities. It must have a processor custom-built for deep learning. It must be software programmable. And industry frameworks must be optimized for it, powered by a developer ecosystem that is accessible and adopted around the world.

The NVIDIA deep learning platform is designed around these three qualities and is the only end-to-end deep learning platform. From training to inference. From data center to the network’s edge.

Learn more about NVIDIA inference platforms.

Get started with TensorRT 3 today.

Performance comparison based on ResNet-50 ingested from TensorFlow trained network using the ImageNet dataset. NVIDIA Tesla V100 GPU running TensorRT 3 RC vs. Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel’s stated claim of 2x performance improvement on Skylake with AVX512.