Steel for the AI Age: DGX SuperPOD Reaches New Heights with NVIDIA DGX A100

by Tony Paikeday

Steel has long been a symbol of industrialization. In the age of AI, a new “building material” will serve as the cornerstone of modern data centers: the NVIDIA DGX A100.

Many of the biggest challenges and opportunities organizations now face are rooted in data. The DGX A100, the world’s most advanced AI system, empowers organizations to solve problems in record time while providing revolutionary elasticity and agility in delivering AI computing power across analytics, training and inference.

Combining multiple DGX systems, last year we introduced the DGX SuperPOD, which achieved top 20-class performance at a fraction of the cost and energy usage of typical supercomputers.

Today, we’re lifting the curtain on our second-generation SuperPOD, which offers record-breaking performance and was deployed in just three weeks. It dismantles the notion that it takes many months to build a world-class AI supercomputing cluster.

Built on NVIDIA DGX A100 systems and NVIDIA Mellanox network fabric, SuperPOD shows that it’s possible to deliver a platform that can reduce processing times on the world’s most complex language understanding models from weeks to under an hour.

Rethinking Infrastructure Scaling

Whether you need a supercomputing cluster to solve huge, monolithic problems, or a center of excellence to democratize access to resources across all of your researchers and developers, AI is a significant infrastructure commitment.

A big part of that has traditionally included pre-planning how big you’ll have to scale, and then laying down the network fabric to support that end goal from day one. This approach was needed to make growth possible, but it created significant upfront costs.

With NVIDIA Mellanox technology, we’re redefining the data center with an architecture that can parallelize the most complex problems and solve them as fast as possible. The DGX A100 comes new Mellanox ConnectX-6 VPI network adaptors with 200Gbps HDR InfiniBand — up to nine interfaces per system. We’re taking advantage of Mellanox switching to make it easier to interconnect systems and achieve SuperPOD-scale.

With DGX SuperPOD and DGX A100, we’ve designed the AI network fabric to make growth easier with a pay-as-you-grow model, while minimizing impact to operations along the way.

Deployment end-states no longer need to be the starting point — we’ve modularized SuperPOD into scalable groups of 20 DGX A100 systems. Each is supported by a two-tiered fat-tree switch network topology, using Mellanox HDR InfiniBand, delivering full bi-section bandwidth with no oversubscription. And by adding a third switching tier, you can scale to thousands of systems using DragonFly+ or fat-tree topologies as part of our extended reference designs.

With this new unit of scale, organizations can enjoy a more linear approach to growth, with more modest incremental spend associated with the addition of each 20-system module.

Expanding DGX SATURNV with SuperPOD

The DGX SATURNV powers NVIDIA’s most important work, from R&D and autonomous vehicle system development to gaming and robotics. And SATURNV isn’t static — it continually grows in response to business demand. This makes it the perfect proving ground for our new SuperPOD design.

Leading up to our DGX A100 announcement, our engineers deployed our newest SuperPOD to deliver approximately 700 petaflops of AI performance. This expansion incorporated:

  • 140 DGX A100 systems
  • 1,120 NVIDIA A100 GPUs
  • 170 Mellanox Quantum 200G InfiniBand switches
  • 15km of optical cable
  • 4PB of high-performance storage

For the storage infrastructure in the expansion, we partnered with DDN. As one of our DGX POD partners, they’re helping us bring the performance and scale needed for our AI infrastructure offerings. SuperPOD let us put DDN technology to work supporting the most challenging workloads we could throw at our most advanced system.

The Best Architecture for Scaling

Not all AI projects require a DGX SuperPOD. But every organization aspiring to infuse their business with AI can leverage the power, agility and scalability of DGX A100 or a DGX POD.

Forward-looking organizations focus on protecting customer loyalty, reducing costs and distancing themselves from competitors. AI is uniquely beneficial in all of these areas.

But AI innovation moves fast, with models and datasets exponentially increasing in size. The right architecture enables companies to tackle their biggest AI challenges now and in the future, without disruption along the way.

Learn how to hone your AI infrastructure strategy and about consumption models that make accessing a DGX A100 easier at www.nvidia.com/DGXA100.