An Engineer Recalls How AI Broke the Exascale Barrier

Hard work in deep learning back in 2018 set the stage for a new era in computing that we celebrate on Exascale Day.
by Dion Harris

Thorsten Kurth still remembers the night he discovered his team broke the exascale barrier.

On the couch at home at 9 p.m., he was pouring over the latest results from one of the first big jobs run on Summit, then the world’s top supercomputer, based at Oak Ridge National Laboratory.

The 12-person team had spent nights and weekends seeking a way that AI could track hundreds of hurricanes and atmospheric rivers buried in terabytes of historical climate data.

Only a few weeks earlier, their software failed to run on more than 64 of the system’s nodes.

But this time — just two days before a paper on the work was due — it exercised 4,560 of Summit’s 4,608 nodes to deliver the results. In the process, it achieved 1.13 exaflops of mixed-precision AI performance.

“That was a good feeling, a lot of hard work paid off,” recalled Kurth of the work he led while at Lawrence Berkeley Lab in 2018.

Entering the Exascale Era

Today, we celebrate the work of everyone who’s cracked a quintillion operations per second.

That’s a billion billion or 10 to the 18th power. That’s why we mark Exascale Day on Oct. 18.

About the same time Kurth’s team was completing its work, researchers at Oak Ridge also entered the exascale era, hitting 1.8, then 2.36 exaflops on Summit, analyzing genomics to better understand the nature of opioid addiction.

COVID-19 Ignites Exascale Work

Since then, many others have pushed the limits of science with GPUs.

In March 2020, the Folding@home project put out a call for donations of free cycles on home computers to run research examining the COVID-19 virus.

Ten days later their virtual, distributed system surpassed 1.5 exaflops, creating a crowd-sourced exascale supercomputer fueled in part by more than 356,000 NVIDIA GPUs.

AI Supercomputing Goes Global

Today, academic and commercial labs worldwide are deploying a new generation of accelerated supercomputers capable of exascale-class AI.

The latest is Polaris, a system Hewlett Packard Enterprise (HPE) is building at Argonne National Lab capable of up to 1.4 AI exaflops. Researchers will use it to advance cancer treatments, explore clean energy and push the limits of physics, work that will be accelerated by 2,240 NVIDIA A100 Tensor Core GPUs.

Another powerful system stands on the campus of the University of California at Berkeley. Perlmutter uses 6,159 A100 GPUs to deliver nearly 4 exaflops of AI performance for more than 7,000 researchers working on projects that include drawing the largest 3D map of the visible universe to date.

Polaris and Perlmutter also use NVIDIA’s software tools to help researchers prototype exascale applications.

Europe Erects Exascale AI Infrastructure

Atos will build an even larger AI supercomputer for Italy’s CINECA research center. Leonardo will pack 14,000 A100 GPUs on an NVIDIA Quantum 200Gb/s InfiniBand network to hit up to 10 exaflops of AI performance.

It’s one of eight systems in a regional network that backers call “an engine to power Europe’s data economy.”

One of Europe’s largest AI-capable supercomputers is slated to come online in Switzerland in 2023. Alps will be built by HPE at the Swiss National Computing Center using NVIDIA GPUs and Grace, our first data center CPU. It’s expected to scale to heights up to 20 AI exaflops.

An Industrial HPC Revolution Begins

The move to high-performance AI extends beyond academic labs.

Advances in deep learning combined with the simulation technology of accelerated computing has put us at the beginnings of an industrial HPC revolution, said NVIDIA founder and CEO Jensen Huang in a keynote earlier this year.

Selene Exascale Day AI
Selene uses a modular architecture based on the NVIDIA DGX SuperPOD

NVIDIA was an early player in this trend.

In the first days of the pandemic, we commissioned Selene, currently ranked as the world’s fastest industrial supercomputer. It helps train autonomous vehicles, refine conversational AI techniques and more.

In June, Tesla Inc. unveiled its own industrial HPC system to train deep neural networks for its electric cars. It packs 5,760 NVIDIA GPUs to deliver up to 1.8 exaflops.

Beyond the Numbers

Three years after winning a Gordon Bell award for breaking the exascale barrier, Kurth, now a senior software engineer at NVIDIA, sees the real fruit of his team’s labors.

Improved versions of the AI model they pioneered are now available online for any climate scientist to use. They handle in an hour what used to take weeks. Governments can use them to plan budgets for disaster response.

In the end, Exascale Day is all about the people, because to succeed at this level, “you need an excellent team with specialists who understand every aspect of what you are trying to do,” Kurth said.