Correcting Intel’s Deep Learning Benchmark MistakesAugust 16, 2016
Benchmarks are an important tool for measuring performance, but in a rapidly evolving field it can be difficult to keep up with the state of the art. Recently Intel published some incorrect “facts” about their long promised Xeon Phi processors.
Few fields are moving faster right now than deep learning. Today’s neural networks are 6x deeper and more powerful than just a few years ago. There are new techniques in multi-GPU scaling that offer even faster training performance.
In addition, our architecture and software have improved neural network training time by over 10x in a year by moving from Kepler to Maxwell to today’s latest Pascal-based systems, like the DGX-1 with eight Tesla P100 GPUs.
So it’s understandable that newcomers to the field may not be aware of all the developments that have been taking place in both hardware and software.
For example, Intel recently published some out-of-date benchmarks to make three claims about deep learning performance with Knights Landing Xeon Phi processors:
- Xeon Phi is 2.3x faster in training than GPUs(1)
- Xeon Phi offers 38% better scaling that GPUs across nodes(2)
- Xeon Phi delivers strong scaling to 128 nodes while GPUs do not(3)
We’d like to address these claims and correct some misperceptions that may arise.
Fresh vs Stale Caffe
Intel used Caffe AlexNet data that is 18 months old, comparing a system with four Maxwell GPUs to four Xeon Phi servers. With the more recent implementation of Caffe AlexNet, publicly available here, Intel would have discovered that the same system with four Maxwell GPUs delivers 30% faster training time than four Xeon Phi servers.
38% Better Scaling
Intel is comparing Caffe GoogleNet training performance on 32 Xeon Phi servers to 32 servers from Oak Ridge National Laboratory’s Titan supercomputer. Titan uses four-year-old GPUs (Tesla K20X) and an interconnect technology inherited from the prior Jaguar supercomputer. Xeon Phi results were based on recent interconnect technology.
Using more recent Maxwell GPUs and interconnect, Baidu has shown that their speech training workload scales almost linearly up to 128 GPUs.
Scalability relies on the interconnect and architectural optimizations in the code as much as the underlying processor. GPUs are delivering great scaling for customers like Baidu.
Strong-Scaling to 128 Nodes
Intel claims that 128 Xeon Phi servers deliver 50x faster performance compared with a single Xeon Phi server, while no such scaling data exists for GPUs. As noted above, Baidu already published results showing near-linear scaling up to 128 GPUs.
For strong-scaling, we believe strong nodes are better than weak nodes. A single strong server with numerous powerful GPUs delivers superior performance than lots of weak nodes, each with one or two sockets of less-capable processors, like Xeon Phi. For example, a single DGX-1 system offers better strong-scaling performance than at least 21 Xeon Phi servers (DGX-1 is 5.3x faster than 4 Xeon Phi servers).
Era of AI
Deep learning has the potential to revolutionize computing, improve our lives, improve the efficiency and intelligence of our business systems, and deliver advancements that will help humanity in profound ways. That’s why we’ve been enhancing the design of our parallel processors and creating software and technologies to accelerate deep learning for many years.
Our dedication to deep learning is deep and broad. Every framework has NVIDIA-optimized support, and every major deep learning researcher, laboratory and company is using NVIDIA GPUs.
While we can correct each of their wrong claims, we think deep learning testing against old Kepler GPUs and outdated software versions are mistakes that are easily fixed in order to keep the industry up to date.
It’s great that Intel is now working on deep learning. This is the most important computing revolution with the era of AI upon us and deep learning is too big to ignore. But they should get their facts straight.