Across Europe and the U.S., HPC developers are supercharging supercomputers with the power of Arm cores and accelerators inside NVIDIA BlueField-2 DPUs.
At Los Alamos National Laboratory (LANL) that work is one part of a broad, multiyear collaboration with NVIDIA that targets 30x speedups in computational multi-physics applications.
LANL researchers foresee significant performance gains using data processing units (DPUs) running on NVIDIA Quantum InfiniBand networks. They will pioneer techniques in computational storage, pattern matching and more using BlueField and its NVIDIA DOCA software framework.
An Open API for DPUs
The efforts also will help further define OpenSNAPI, an application interface anyone can use to harness DPUs. It’s a project of the Unified Communication Framework, a consortium enabling heterogeneous computing for HPC apps whose members include Arm, IBM, NVIDIA, U.S. national labs and U.S. universities.
LANL is already feeling the power of in-network computing, thanks to a DPU-powered storage system it created.
The Accelerated Box of Flash (ABoF, pictured below) combines solid-state storage with DPU and InfiniBand accelerators to speed up performance-critical parts of a Linux file system. It’s up to 30x faster than similar storage systems and set to become a key component in LANL’s infrastructure.
ABoF places computation near storage to minimize data movement and improve the efficiency of both simulation and data-analysis pipelines, a researcher said in a recent LANL blog.
Texas Rides a Cloud-Native Super
The Texas Advanced Computing Center (TACC) is the latest to adopt BlueField-2 in Dell PowerEdge servers. It will use the DPUs on an InfiniBand network to make its Lonestar6 system a development platform for cloud-native supercomputing.
TACC’s Lonestar6 serves a wide swath of HPC developers at Texas A&M University, Texas Tech University, and the University of North Texas, as well as a number of research centers and the faculty.
MPI Gets Accelerated
Twelve hundred miles to the northeast, researchers at Ohio State University showed how DPUs can make one of HPC’s most popular programming models run up to 26 percent faster.
By offloading critical parts of the message passing interface (MPI), they accelerated P3DFFT, a library used in many large-scale HPC simulations.
“DPUs are like assistants that handle work for busy executives, and they will go mainstream because they can make all workloads run faster,” said Dhabaleswar K. (DK) Panda, a professor of computer science and engineering at Ohio State who led the DPU work using his team’s MVAPICH open source software.
DPUs in HPC Centers, Clouds
Double-digit boosts are huge for supercomputers running HPC simulations like drug discovery or aircraft design. And cloud services can use such gains to increase their customers’ productivity, said Panda, who’s had requests from multiple HPC centers for his code.
Quantum InfiniBand networks with features like NVIDIA SHARP help make his work possible.
“Others are talking about in-network computing, but InfiniBand supports it today,” he said.
Durham Does Load Balancing
Multiple research teams in Europe are accelerating MPI and other HPC workloads with BlueField DPUs.
For example, Durham University, in northern England, is developing software for load balancing MPI jobs using BlueField DPUs on a 16-node Dell PowerEdge cluster. Its work will pave the way for more efficient processing of better algorithms for HPC facilities around the world, said Tobias Weinzierl, principal investigator for the project.
DPUs in Cambridge, Munich
Researchers in Cambridge, London and Munich are also using DPUs.
For its part, University College London is exploring how to schedule tasks for a host system on BlueField-2 DPUs. It’s a capability that could be used, for example, to move data between host processors so it’s there when they need it.
BlueField DPUs inside Dell PowerEdge servers in the Cambridge Service for Data Driven Discovery offload security policies, storage frameworks and other jobs from host CPUs, maximizing the system’s performance.
Meanwhile, researchers in the computer architecture and parallel systems group at the Technical University of Munich are seeking ways to offload both MPI and operating system tasks with DPUs as part of a EuroHPC project.
Back in the U.S., researchers at Georgia Tech are collaborating with the Sandia National Laboratories to speed work in molecular dynamics using BlueField-2 DPUs. A paper describing their work so far shows algorithms can be accelerated by up to 20 percent with no loss in the accuracy of simulations.
An Expanding Network
Earlier this month, researchers in Japan announced a system using the latest NVIDIA H100 Tensor Core GPUs riding our fastest and smartest network ever, the NVIDIA Quantum-2 InfiniBand platform.
NEC will build the approximately 6 PFLOPS, H100-based supercomputer for the Center for Computational Sciences at the University of Tsukuba. Researchers will use it for climatology, astrophysics, big data, AI and more.
Meanwhile, researchers like Panda are already thinking about how they’ll use the cores in BlueField-3 DPUs.
“It will be like hiring executive assistants with college degrees instead of ones with high school diplomas, so I’m hopeful more and more offloading will get done,” he quipped.