Accelerating AI Inference Performance in the Data Center and BeyondSeptember 25, 2017
Inference is the technology that puts sophisticated neural networks — trained on powerful GPUs — into use solving problems for everyday users.
Most inference work has been focused on “after hours” large-batch, high-throughput work done on large numbers of CPU servers. But that’s changing, fast.
Going forward, the trends in inference are toward sophisticated real-time services based on more complex models for speech recognition, natural language processing and translation — areas where low latency is critical.
The NVIDIA Tesla platform can deliver significant performance upsides — for both training and inference, the two critical, compute-intensive operations at the core of deep learning — as well as massive acquisition and energy cost savings.
As deep learning spreads, CPU-only servers can’t scale to meet this demand. A single GPU-equipped server can replace 160 CPU-only servers. And it can deliver higher inference throughput with low latencies to not only meet the demands of this trend, but accelerate them.
Introducing TensorRT 3
Inference performance is about more than just speed. In fact, there are four key aspects that must be considered to get a complete inference performance picture: throughput, energy efficiency, latency and accuracy.
To maximize the inference performance and efficiency of NVIDIA deep learning platforms, we’re now offering TensorRT 3, the world’s first programmable inference accelerator. It compresses, optimizes and deploys a trained neural network as a runtime to deliver accurate, low-latency inference, without the overhead of a framework.
TensorRT features include:
Weight & Activation Precision Calibration: Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss
Layer & Tensor Fusion: Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution
Dynamic Tensor Memory: Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration its usage
Multi-Stream Execution: Scales to multiple input streams by processing them in parallel using the same model and weights
For speech-based usages, researchers have recently described a higher latency threshold of around 200ms as being acceptable. OpenNMT is a recurrent neural network (RNN) for handling translation, in this example from English to German.
A platform built for deep learning must have three distinct qualities. It must have a processor custom-built for deep learning. It must be software programmable. And industry frameworks must be optimized for it, powered by a developer ecosystem that is accessible and adopted around the world.
The NVIDIA deep learning platform is designed around these three qualities and is the only end-to-end deep learning platform. From training to inference. From data center to the network’s edge.
Learn more about NVIDIA inference platforms.
Get started with TensorRT 3 today.
Performance comparison based on ResNet-50 ingested from TensorFlow trained network using the ImageNet dataset. NVIDIA Tesla V100 GPU running TensorRT 3 RC vs. Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel’s stated claim of 2x performance improvement on Skylake with AVX512.