The recent TPU paper by Google draws a clear conclusion – without accelerated computing, the scale-out of AI is simply not practical.
Today’s economy runs in the world’s data centers, and data centers are changing dramatically. Not so long ago, they served up web pages, advertising and video content. Now, they recognize voices, detect images in video streams and connect us with information we need exactly when we need it.
Increasingly, those capabilities are enabled by a form of artificial intelligence called deep learning. Deep learning is an algorithm that learns from massive amounts of data to create software that can tackle such challenges as translating languages, diagnosing cancer and teaching autonomous cars to drive. The change brought about by AI is accelerating at a pace never seen before in our industry.
A pioneering researcher of deep learning, Geoffrey Hinton, told The New Yorker recently, “Take any old classification problem where you have a lot of data, and it’s going to be solved by deep learning. There’s going to be thousands of applications of deep learning.”
Unreasonably Effective Results
Take Google. Its application of groundbreaking work in deep learning has captured the world’s attention: The startling precision of its Google Now service; the landmark victory over the world’s greatest Go player; Google Translate’s ability to operate in 100 different languages.
Deep learning has achieved unreasonably effective results. But the approach demands that computers process vast seas of data at precisely the time when Moore’s law is slowing. Deep learning is a new computing model that has required the invention of a new computing architecture.
This changing architecture of the AI compute model has occupied NVIDIA for some time. In 2010, Dan Ciresan, a researcher at Professor Juergen Schmidhuber’s Swiss AI Lab, discovered that NVIDIA GPUs can be used to train deep neural networks and achieved a speedup of 50 times over CPUs. A year later, Schmidhuber’s lab used GPUs to develop the first pure deep neural networks that won international contests in handwriting recognition and computer vision.
Then, in 2012, Alex Krizhevsky, then a grad student at the University of Toronto, won the now-famous annual ImageNet large-scale image recognition competition using a pair of GPUs. (Schmidhuber has chronicled a comprehensive history of the impact of GPU deep learning on modern computer vision.)
Optimizing for Deep Learning
AI researchers all over the world have discovered that the GPU-accelerated computing model NVIDIA had pioneered for computer graphics and supercomputing applications is ideal for deep learning. Deep learning – like 3D graphics, medical imaging, molecular dynamics, quantum chemistry and weather simulations – is a linear-algebra algorithm that requires massively parallel computation of tensors, or multi-dimensional vectors. And while NVIDIA’s Kepler-generation GPU, architected in 2009, helped awaken the world to the possibility of using GPU-accelerated computing in deep learning, it was never specifically optimized for that task.
We got to work, developing new generations of GPU architecture, first Maxwell, and then Pascal, which included many architecture advances specifically for deep learning. Introduced just four years after the Kepler-based Tesla K80, our Pascal-based Tesla P40 Inferencing Accelerator delivers 26x its deep-learning inferencing performance, far outstripping Moore’s law.
During this time, Google designed a custom accelerator chip called the tensor processing unit, or TPU, specifically to handle inferencing, which it deployed in 2015.
Its team released technical information about the benefits of TPUs this past week. It asserts, among other things, that the TPU has 13x the inferencing performance of the K80. However, it doesn’t compare the TPU to the current generation Pascal-based P40.
Updating Google’s Comparison
To update Google’s comparison, we created the chart below to quantify the performance leap from K80 to P40, and to show how the TPU compares to current NVIDIA technology.
The P40 balances computational precision and throughput, on-chip memory and memory bandwidth to achieve unprecedented performance for training, as well as inferencing. For training, P40 has 10x the bandwidth and 12 teraflops of 32-bit floating point performance. For inferencing, P40 has high-throughput 8-bit integer and high-memory bandwidth.
While Google and NVIDIA chose different development paths, there were several themes common to both our approaches. Specifically:
- AI requires accelerated computing. Accelerators provide the significant data processing necessary to keep up with the growing demands of deep learning in an era when Moore’s law is slowing.
- Tensor processing is at the core of delivering performance for deep learning training and inference.
- Tensor processing is a major new workload enterprises must consider when building modern data centers.
- Accelerating tensor processing can dramatically reduce the cost of building modern data centers.
The technology world is in the midst of a historic transformation already being referred to as the AI Revolution. The place where its impact is most obvious today is in the hyperscale data centers of Alibaba, Amazon, Baidu, Facebook, Google, IBM, Microsoft, Tencent and others. They need to accelerate AI workloads without having to spend billions of dollars building and powering new data centers with CPU nodes. Without accelerated computing, the scale-out of AI is simply not practical.
GPU-accelerated computing has enabled deep learning and ignited modern AI. Come to our GPU Technology Conference, on May 8-11, in San Jose, California. You’ll hear AI pioneers talk about their groundbreaking discoveries, and learn about the latest advances in GPU computing and how they are revolutionizing one industry after another.