Blood, Software and 120 Billion Transistors: How NVIDIA Built DGX-1July 11, 2016
Duct tape. Plexiglass. Plastic ties. Ear plugs. Band-Aids.
If you’re building something for the first time — whether it’s a spaceship or a sports car — this is where you start. And if you look through the expense reports of the engineering teams that put together our NVIDIA DGX-1 deep learning system, that’s what you’ll find.
The result: an immaculate $129,000 jewel box compact enough to slide into a display case at Tiffany’s. Yet it delivers up to 170 teraflops of computing power. That’s equal to 250 x86 servers — performance one news outlet described as “insane.”
Introduced at our GPU Technology Conference in April, the DGX-1 is loaded with deep learning software that spits out results with super-human capabilities. It’s all wired into cloud-based services for quick deployment and instant system updates. No assembly required.
DGX-1: Fuel for the AI Boom
That makes it perfect for anyone looking to take deep learning out of the research lab and put it to work quickly and more easily. Over the past five years, researchers have built systems that began to rival — and soon exceeded — what humans could do on tasks once thought impossible for computers to handle.
Hundreds of millions of people now rely on services powered by deep learning for speech recognition, real-time voice translation and video discovery. More’s coming. But it takes time and people to build deep learning systems from disparate parts. This is where DGX-1 comes in. The DGX-1 will open new opportunities for all industries and our partner ecosystem to benefit.
Blood, Duct Tape, Software
The story behind DGX-1 is the tale of a race — fueled by blood, duct tape and software — by interconnected teams of engineers to finish each piece of a radical new deep learning system the moment the next team would need it.
“This isn’t just a piece of hardware, this isn’t just a piece of software,” says Mike, one of the key engineers involved in the project. “Just click three UI buttons, and you get all these new capabilities.”
Speed of Light
The race began a year ago. In March 2015, NVIDIA CEO Jen-Hsun Huang promised attendees at that edition of our GPU Technology Conference that Pascal, our upcoming GPU architecture, would deliver 10x better performance on key deep learning tasks in one year. The problem: it would take weeks — even months — for researchers and companies to build machines around these new GPUs and put them to work.
A couple months later, at a meeting of company leaders, Huang challenged NVIDIA’s engineering teams to build a server around Pascal in time for GTC 2016, in April. It would allow researchers and businesses to flip a switch and put the power of eight of these GPUs to work for deep learning.
This would take much more than just hardware built around chips that didn’t even exist yet. Twenty-five separate pieces of the DGX-1’s software “stack,” — from the open-source Ubuntu operating system to our DIGITS deep-learning training system to our CUDA Deep Neural Network (cuDNN) GPU-accelerated library of primitives and an array of NVIDIA drivers — would need to work in harmony.
Jen-Hsun challenged them to bring all these pieces together at the “speed of light,” or to think about the fundamental limits of what’s possible, and push to those extremes.
Roughly a dozen separate engineering teams swung into action. “There is no other company that knows how to swarm like we do,” says John, the product architect and engineering lead, summing up the project. “You just identify a handful of leaders and they pull on everyone they need to make it happen.”
Here’s how it happened:
- May 2015 — A team of engineers sketches out a radical new topology that will yoke together the eight GPUs inside DGX-1, each with 15 billion transistors. The solution: a cube mesh. The design will let users throw eight GPUs at deep learning tasks, or split the system into two separate subsystems to tackle more traditional high performance computing work. But they won’t know if it will work for another seven months. The first samples of Pascal, the first GPU to use NVLink — our high-speed interconnect technology that will power the mesh — won’t arrive until the last quarter of 2015 (see “What Is NVLink?”)
- September 2015 — Teams of software engineers begin building system software called NCCL, for NVIDIA Collective Communication Library, which will run on top of DGX-1’s cube mesh topology. Other teams begin tuning a software stack that will run on top of NCCL that includes the most used deep learning and high performance computing tools — such as Caffe, Theano, Torch, TensorFlow and CNTK.
- November 2015 — Engineers begin the painstaking process of “bringing up” the first samples of Pascal from the chip fab, or factory. This is no ordinary bring-up. For Pascal, NVIDIA’s GPU designers created a new architecture that includes features that will help users tear through deep learning problems. They’re also the first GPUs built with features as little as 16 nm wide, a quarter of the length your fingernails grow every minute (see “NVIDIA Delivers Massive Performance Leap for Deep Learning, HPC Applications, With Tesla P100 Accelerators”).
- December 2015 — With the new Pascal GPUs running, engineers begin putting them into a working system. The catch: The first chassis for DGX-1 wouldn’t be ready until the end of January. So engineers use metal, duct tape and plexiglass to improvise a rig, with cuts and scrapes ensuing. They then connect two GPUs, then three, but — in a potential showstopper — can’t connect a fourth. Turns out there were two parentheses missing from a key piece of code. Two keystrokes later, the network springs to life. “It was like it was just meant to be,” one engineer says.
- January 2016- Once the insides of DGX-1 are finished, NVIDIA’s industrial design team begins using their new digital rendering tool called Iray to build precise models of DGX-1’s bezel and machined aluminum enclosure. In March, they choose a metalized foam — a lightweight, super-strong material used in planes — to give the machine the ability to suck in cool air faster than conventional perforated metal.
- March 29, 2016 — The enclosure for the final server prototype is hand-carried on a plane back from a modeling shop in South Korea. With less than a week until GTC, all the pieces of DGX-1 come together for the first time. Within days, the system — powered by eight soon-to-be announced Tesla P100 GPUs — delivers 10x performance improvements on the AlexNet deep learning benchmark, completing tasks in two hours that once took more than 20.
- April 3, 2016 — The day before GTC’s opening, DGX-1 clocks a 12x performance improvement on AlexNet.
- April 5, 2016 — Huang presents the first DGX-1 server to the world. Camera shutters click. Enthusiasts gawk. Reporters write articles.
- May 30, 2016 — NVIDIA engineers ready the first batch of DGX-1 systems for customers. One customer, though, couldn’t wait. The first unit, shown off at GTC, is now whirring away on a server rack deep inside our Silicon Valley headquarters. It’s crunching through data collected by our autonomous driving team in New Jersey for our DRIVE PX autonomous driving platform. “After all,” one NVIDIAN muses as he looks over the machine, “why slow down now?”
Learn More About DGX-1