AI Drives the Rise of Accelerated Computing in Data Centers

The recent TPU paper by Google draws a clear conclusion – without accelerated computing, the scale-out of AI is simply not practical.

Today’s economy runs in the world’s data centers, and data centers are changing dramatically. Not so long ago, they served up web pages, advertising and video content. Now, they recognize voices, detect images in video streams and connect us with information we need exactly when we need it.

Increasingly, those capabilities are enabled by a form of artificial intelligence called deep learning. Deep learning is an algorithm that learns from massive amounts of data to create software that can tackle such challenges as translating languages, diagnosing cancer and teaching autonomous cars to drive. The change brought about by AI is accelerating at a pace never seen before in our industry.

A pioneering researcher of deep learning, Geoffrey Hinton, told The New Yorker recently, “Take any old classification problem where you have a lot of data, and it’s going to be solved by deep learning. There’s going to be thousands of applications of deep learning.”

Unreasonably Effective Results

Take Google. Its application of groundbreaking work in deep learning has captured the world’s attention: The startling precision of its Google Now service; the landmark victory over the world’s greatest Go player; Google Translate’s ability to operate in 100 different languages.

Deep learning has achieved unreasonably effective results. But the approach demands that computers process vast seas of data at precisely the time when Moore’s law is slowing. Deep learning is a new computing model that has required the invention of a new computing architecture.

This changing architecture of the AI compute model has occupied NVIDIA for some time. In 2010, Dan Ciresan, a researcher at Professor Juergen Schmidhuber’s Swiss AI Lab, discovered that NVIDIA GPUs can be used to train deep neural networks and achieved a speedup of 50 times over CPUs. A year later, Schmidhuber’s lab used GPUs to develop the first pure deep neural networks that won international contests in handwriting recognition and computer vision.

Then, in 2012, Alex Krizhevsky, then a grad student at the University of Toronto, won the now-famous annual ImageNet large-scale image recognition competition using a pair of GPUs. (Schmidhuber has chronicled a comprehensive history of the impact of GPU deep learning on modern computer vision.)

Optimizing for Deep Learning

AI researchers all over the world have discovered that the GPU-accelerated computing model NVIDIA had pioneered for computer graphics and supercomputing applications is ideal for deep learning. Deep learning – like 3D graphics, medical imaging, molecular dynamics, quantum chemistry and weather simulations – is a linear-algebra algorithm that requires massively parallel computation of tensors, or multi-dimensional vectors. And while NVIDIA’s Kepler-generation GPU, architected in 2009, helped awaken the world to the possibility of using GPU-accelerated computing in deep learning, it was never specifically optimized for that task.

We got to work, developing new generations of GPU architecture, first Maxwell, and then Pascal, which included many architecture advances specifically for deep learning. Introduced just four years after the Kepler-based Tesla K80, our Pascal-based Tesla P40 Inferencing Accelerator delivers 26x its deep-learning inferencing performance, far outstripping Moore’s law.

During this time, Google designed a custom accelerator chip called the tensor processing unit, or TPU, specifically to handle inferencing, which it deployed in 2015.

Its team released technical information about the benefits of TPUs this past week. It asserts, among other things, that the TPU has 13x the inferencing performance of the K80. However, it doesn’t compare the TPU to the current generation Pascal-based P40.

Updating Google’s Comparison

To update Google’s comparison, we created the chart below to quantify the performance leap from K80 to P40, and to show how the TPU compares to current NVIDIA technology.

The P40 balances computational precision and throughput, on-chip memory and memory bandwidth to achieve unprecedented performance for training, as well as inferencing. For training, P40 has 10x the bandwidth and 12 teraflops of 32-bit floating point performance. For inferencing, P40 has high-throughput 8-bit integer and high-memory bandwidth.

Data based on “In-Datacenter Performance Analysis of a Tensor Processing Unit,” Jouppi et al [Jou17], and NVIDIA internal benchmarking. K80 to TPU performance ratios are based on the average of CNN0 and CNN1 acceleration ratios from [Jou17], which compared performance to a half-enabled K80. K80 to P40 performance ratios are based on GoogLeNet, a publicly available CNN model with similar performance properties.
While Google and NVIDIA chose different development paths, there were several themes common to both our approaches. Specifically:

  • AI requires accelerated computing. Accelerators provide the significant data processing necessary to keep up with the growing demands of deep learning in an era when Moore’s law is slowing.
  • Tensor processing is at the core of delivering performance for deep learning training and inference.
  • Tensor processing is a major new workload enterprises must consider when building modern data centers.
  • Accelerating tensor processing can dramatically reduce the cost of building modern data centers.

The technology world is in the midst of a historic transformation already being referred to as the AI Revolution. The place where its impact is most obvious today is in the hyperscale data centers of Alibaba, Amazon, Baidu, Facebook, Google, IBM, Microsoft, Tencent and others. They need to accelerate AI workloads without having to spend billions of dollars building and powering new data centers with CPU nodes. Without accelerated computing, the scale-out of AI is simply not practical.

  • Frank Busborg

    Thanks for the update. I am looking forward to a future Volta comparison.
    Awesome work Nvidia.

  • Mike Lee

    well said, thank you!!

  • Wei Tan

    What is the relation between Inference/Sec and Inference TOPS? Why TPU is lower in Inference/Sec but higher in Inference TOPS?

  • Simon Gu

    That means the TPU is limited by the bandwidth or latency, or you could say GOOGLE over design the ALU, which is not a balance system.

  • Wei Tan

    Thanks. So Inference TOPS is the performance from bare metal and Inference/Sec is the actual performance observed. I can see that TPU over provisioned ALU. To hit the plateau of the roof-line the operation density has to be as big as 1350 (10-20 for GPUs).

  • John Weber

    So what I essentially gather from this is that Nvidia has a faster memory bandwidth – why is this a strong argument against TPUs? Is there a belief that the “concept” of TPUs is bound by this? It is not as though Nvidia holds some unique IP/advantage for increasing memory bandwidth. Sure Nvidia has NVLink, but are we ever going to see the most important players support this so that this can actually be taken advantage of? With Google’s support of OpenCAPI and the existence of CCIX, we may have yet to see another take on this.

    An important but unmentioned characteristic is the power consumption of these platforms. Even looking at this “revised” comparison, one can still see the advantage that the TPU brings to the table. The GPU might currently double the inference speed, but at over 3x the power consumption. Power might not be an issue for small academic research teams, but when you look at the scales of Google, Baidu, etc., this matters – and it matters a lot.

    I believe there is a deeper struggle that is surfacing and that is the nature of what acceleration needs to be. How complex are the tasks they deal with and how general purpose do they need to be? This an interesting time for acceleration and I’m excited to see what’s coming next.