Amped Up: HPC Centers Ride A100 GPUs to Accelerate Science

Supercomputers put AI in the loop, moving into the exascale era with the NVIDIA Ampere architecture.
by Dion Harris

Six supercomputer centers around the world are among the first to adopt the NVIDIA Ampere architecture. They’ll use it to bring science into the exascale era in fields from astrophysics to virus microbiology.

The high performance computing centers scattered across the U.S. and Germany will use a total of nearly 13,000 A100 GPUs.

Together these GPUs pack more than 250 petaflops in peak performance for simulations that use 64-bit floating point math. For AI inference jobs that use mixed precision math and leverage the A100 GPU’s support for sparsity, they deliver a whopping 8.07 exaflops.

Researchers will harness that horsepower to drive science forward in many dimensions. They plan to simulate larger models, train and deploy deeper networks, and pioneer an emerging hybrid field of AI-assisted simulations.

Argonne deployed one of the first NVIDIA DGX-A100 systems. Photo courtesy of Argonne National Laboratory.

For example, Argonne’s researchers will seek a COVID-19 vaccine by simulating a key part of a protein spike on a coronavirus that’s made up of as many as 1.5 million atoms.

The molecule “is a beast, but the A100 lets us accelerate simulations of these subsystems so we can understand how this virus infects humans,” said Arvind Ramanathan, a computational biologist at Argonne National Laboratory that will use a cluster of 24 NVIDIA DGX A100 systems.

In other efforts, “we will see substantial improvement in drug discovery by scanning millions and billions of drugs at a time. And we may see things we could never see before, like how two proteins bind to one another,” he said.

A100 Puts AI in the Scientific Loop

“Much of this work is hard to simulate on a computer, so we use AI to intelligently guide where and when we will sample next,” said Ramanathan.

It’s part of an emerging trend of scientists using AI to steer simulations. The GPUs then will speed up the time to process biological samples by “at least two orders of magnitude,” he added.

Across the country, the National Energy Research Scientific Computing Center (NERSC) is poised to become the largest of the first wave of A100 users. The center in Berkeley, Calif., is working with Hewlett Packard Enterprise to deploy 6,200 of the GPUs in Perlmutter, its pre-exascale system.

“Across NERSC’s science and algorithmic areas, we have increased performance by up to 5x when comparing a single V100 GPU to a KNL CPU node on our current-generation Cori system, and we expect even greater gains with the A100 on Perlmutter,” said Sudip Dosanjh, NERSC’s director.

Exascale Computing Team Works on Simulations, AI

A team dedicated to exascale computing at NERSC has defined nearly 30 projects for Perlmutter that use large-scale simulations, data analytics or deep learning. Some projects blend HPC with AI, such as one using reinforcement learning to control light source experiments. Another employs generative models to reproduce expensive simulations at high-energy physics detectors.

Two of NERSC’s HPC applications already prototyped use of the A100 GPU’s double-precision Tensor Cores. They’re seeing significant increases in performance over previous generation Volta GPUs.

Software optimized for the 10,000-way parallelism Perlmutter’s GPUs offer will be ready to run on future exascale systems, Christopher Daley, an HPC performance engineer at NERSC said in a talk at GTC Digital. NERSC supports nearly a thousand scientific applications in areas such as astrophysics, Earth science, fusion energy and genomics.

“On Perlmutter, we need compilers that support all the programming models our users need and expect — MPI, OpenMP, OpenACC, CUDA and optimized math libraries. The NVIDIA HPC SDK checks all of those boxes,” said Nicholas Wright, NERSC’s chief architect.

German Effort to Map the Brain

AI will be the focus of some of the first applications for the A100 on a new 70-petaflops system designed by France’s Atos for the Jülich Supercomputing Center in western Germany.

One, called Deep Rain, aims to make fast, short-term weather predictions, complementing traditional systems that use large, relatively slow simulations of the atmosphere. Another project plans to construct an atlas of fibers in the human brain, assembled with deep learning from thousands of high-resolution 2D brain images.

The new A100 system at Jülich also will help researchers push the edges of understanding the strong forces binding quarks, the sub-atomic building blocks of matter. At the macro scale, a climate science project will model the Earth’s surface and subsurface water flow.

“Many of these applications are constrained by memory,” said Dirk Pleiter, a theoretical physicist who manages a research team in applications-oriented technology development at Jülich. “So, what is extremely interesting for us is the increased memory footprint and memory bandwidth of the A100,” he said.

The new GPU’s ability to accelerate double-precision math by up to 2.5x is another feature researchers are keen to harness. “I’m confident when people realize the opportunities of more compute performance, they will have a strong incentive to use GPUs,” he added.

Data-Hungry System Likes Fast NVLink

Some 230 miles south of Jülich, the Karlsruhe Institute of Technology (KIT) is partnering with Lenovo to build a new 17-petaflops system that will pack 740 A100 GPUs on an NVIDIA Mellanox 200 Gbit/s InfiniBand network. It will tackle grand challenges that include:

  • Atmospheric simulations at the kilometer scale for climate science
  • Research to fight COVID-19, including support for Folding@home
  • Explorations of particle physics beyond the Higgs boson for the Large Hadron Collider
  • Research on next-generation materials that could replace lithium-ion batteries
  •  AI applications in robotics, language processing and renewable energy

“We focus on data-intensive simulations and AI workflows, so we appreciate the third-generation NVLink connecting the new GPUs,” said Martin Frank, director of KIT’s supercomputing center and a professor of computational science and math.

“We also look forward to the multi-instance GPU feature that effectively gives us up to 28 GPUs per node instead of four — that will greatly benefit many of our applications,” he added.

Just outside Munich, the computer center for the Max Planck Institute is creating with Lenovo a system called Raven-GPU, powered by 768 NVIDIA A100 GPUs. It will support work in fields like astrophysics, biology, theoretical chemistry and advanced materials science. The research institute aims to have Raven-GPU installed by the end of the year and is taking requests now for support porting applications to the A100.

Indiana System Counters Cybersecurity Threats

Finally, Indiana University is building Big Red 200, a 6 petaflops system expected to become the fastest university-owned supercomputer in the U.S. It will use 256 A100 GPUs.

Announced in June, it’s among the first academic centers to adopt the Cray Shasta technology from Hewlett Packard Enterprise that others will use in future exascale systems.

Big Red 200 will apply AI to counter cybersecurity threats. It also will tackle grand challenges in genetics to help enable personalized healthcare as well as work in climate modeling, physics and astronomy.

Photo at top: Shyh Wang Hall at UC Berkeley will be the home of NERSC’s Perlmutter supercomputer.