And you think you’ve mastered social distancing.
Selene is at the center of some of NVIDIA’s most ambitious technology efforts.
Selene sends thousands of messages a day to colleagues on Slack.
Selene’s wired into GitLab, a key industry tool for tracking the deployment of code, providing instant updates to colleagues on how their projects are going.
One of NVIDIA’s best resources works just a block from NVIDIA’s Silicon Valley, Calif., campus, but Selene can only be visited during the pandemic only with the aid of a remote-controlled robot.
Selene is, of course, a supercomputer.
The world’s fastest commercial machine, Selene was named the world’s fifth-fastest supercomputer in the world on November’s closely watched list of TOP500 supercomputers.
Built with new NVIDIA A100 GPUs, Selene achieved 63.4 petaflops on HPL, a key benchmark for high-performance computing, on that same TOP500 list.
While the TOP500 benchmark, originally launched in 1993, continues to be closely watched, a more important metric today is peak AI performance.
By that metric, using the A100’s 3rd generation tensor core, Selene delivers over 2,795 petaflops*, or nearly 2.8 exaflops, of peak AI performance.
Learn more: NVIDIA DGX SuperPOD solution for enterprise – the fastest path to innovation at scale.
The new version of Selene doubles the performance over the prior version, which holds all eight performance records on MLPerf AI Training benchmarks for commercially available products.
But what’s remarkable about this machine isn’t its raw performance. Or how long it takes the two-wheeled, NVIDIA Jetson TX2 powered robot, dubbed “Trip,” tending Selene to traverse the co-location facility — a kind of hotel for computers — housing the machine.
Or even the quiet (by supercomputing standards) hum of the fans cooling its 555,520 computing cores and 1,120,000 gigabytes of memory, all connected by NVIDIA Mellanox HDR InfiniBand networking technology.
It’s how closely it’s wired into the day-to-day work of some of NVIDIA’s top researchers.
That’s why — with the rest of the company downshifting for the holidays — Mike Houston is busier than ever.
Houston, who holds a Ph.D. in computer science from Stanford and is a recent winner of the ACM Gordon Bell Prize, is NVIDIA’s AI systems architect, coordinating time on Selene among more than 450 active users at the company.
Sorting through proposals to do work on the machine is a big part of his job. To do that, Houston says he aims to balance research, advanced development and production workloads.
NVIDIA researchers such as Bryan Catanzaro, vice president for applied deep learning research, say there’s nothing else like Selene.
“Selene is the only way for us to do our most challenging work,” Catanzaro said, whose team will be putting the machine to work the week of the 21st. “We would not be able to do our jobs without it.”
Catanzaro leads a team of more than 40 researchers who are using the machine to help advance their work in large-scale language modeling, one of the toughest AI challenges
His words are echoed by researchers across NVIDIA vying for time on the machine.
Built in just three weeks this spring, Selene’s capacity has more than doubled since it was first turned on. That makes it the crown jewel in an ever-growing, interconnected complex of supercomputing power at NVIDIA.
In addition to large-scale language modeling, and, of course, performance runs, NVIDIA’s computing power is used by teams working on everything from autonomous vehicles to next-generation graphics rendering to tools for quantum chemistry and genomics.
Having the ability to scale up to tackle big jobs, or tear off just enough power to tackle smaller tasks, is key, explains Marc Hamilton, vice president for solutions architecture and engineering at NVIDIA.
Hamilton matter of factly compares it to moving dirt. Sometimes a wheelbarrow is enough to get the job done. But for other jobs, where you need more dirt, you can’t get the job done without a dump truck.
“We didn’t do it to say it’s the fifth-fastest supercomputer on Earth, but because we need it, because we use it every day,” Hamilton says.
The Fast and the Flexible
It helps that the key component Selene is built with, NVIDIA DGX SuperPOD, is incredibly efficient.
A SuperPOD achieved 26.2 gigaflops/watt power-efficiency during its 2.4 HPL performance run, placing it atop the latest Green500 list of world’s most efficient supercomputers.
That efficiency is a key factor in its ability to scale up, or carry bigger computing loads, by merely adding more SuperPODs.
Each SuperPOD, in turn, is comprised of compact, pre-configured DGX A100 systems, which are built using the latest NVIDIA Ampere architecture A100 GPUs and NVIDIA Mellanox InfiniBand for the compute and storage fabric.
Continental, Lockheed Martin and Microsoft are among the businesses that have adopted DGX SuperPODs.
The University of Florida’s new supercomputer, expected to be the fastest in academia when it goes online, is also based on SuperPOD.
Selene is now composed of four SuperPODs, each with a total of 140 nodes, each a NVIDIA DGX A100, giving Selene a total of 560 nodes, up from 280 earlier this year.
A Need for Speed
That’s all well and good, but Catanzaro wants all the computing power he can get.
Catanzaro, who holds a doctorate in computer science from UC Berkeley, helped pioneer the use of GPUs to accelerate machine learning a decade ago by swapping out a 1,000 CPU system for three off-the-shelf NVIDIA Geforce GTX 580 GPUs, letting him work faster.
It was one of a number of key developments that led to the deep learning revolution. Now, nearly a decade later, Catanzaro figures he has access to roughly a million times more power thanks to Selene.
“I would say our team is being really well supported by NVIDIA right now, we can do world-class, state-of-the-art things on Selene,” Catanzaro says. “And we still want more.”
That’s why — while NVIDIANs have set up Microsoft Outlook to respond with an away message as they take the week off — Selene will be busier than ever.
Click, here, to learn more about SuperPOD for enterprises.
*2,795 petaflops FP16/BF16 with structural sparsity enabled.