Secure AI Data Centers at Scale: Next-Gen DGX SuperPOD Opens Era of Cloud-Native Supercomputing

by Tony Paikeday

As businesses extend the power of AI and data science to every developer, IT needs to deliver seamless, scalable access to supercomputing with cloud-like simplicity and security.

At GTC21, we introduced the latest NVIDIA DGX SuperPOD, which gives business, IT and their users a platform for securing and scaling AI across the enterprise, with the necessary software to manage it as well as a white-glove services experience to help operationalize it.

Solving AI Challenges of Every Size, at Massive Scale

Since its introduction, DGX SuperPOD has enabled enterprises to scale their development on infrastructure that can tackle problems of a size and complexity that were previously unsolvable in a reasonable amount of time. It’s AI infrastructure built and managed the way NVIDIA does its own.

As AI gets infused into almost every aspect of modern business, the need to deliver almost limitless access to computational resources powering development has been scaling exponentially. This escalation in demand is exemplified by business-critical applications like natural language processing, recommender systems and clinical research.

Organizations often tap into the power of DGX SuperPOD in two ways. Some use it to solve huge, monolithic problems such as conversational AI, where the computational power of an entire DGX SuperPOD is brought to bear to accelerate the training of complex natural language processing models.

Others use DGX SuperPOD to service an entire company, providing multiple teams access to the system to support fluctuating needs across a wide variety of projects. In this mode, enterprise IT is often acting as a service provider, managing this AI infrastructure-as-a-service, with multiple users (perhaps even adversarial ones) who need and expect complete isolation of each other’s work and data.

DGX SuperPOD with BlueField DPU

Increasingly, businesses need to bring the world of high-performance AI supercomputing into an operational mode where many developers can be assured their work is secure and isolated like it is in cloud. And where IT can manage the environment much like a private cloud, with the ability to deliver resources to jobs, right-sized to the task, in a secure, multi-tenant environment.

This is called cloud-native supercomputing and it’s enabled by NVIDIA BlueField-2 DPUs, which bring accelerated, software-defined data center networking, storage, security and management services to AI infrastructure.

With a data processing unit optimized for enterprise deployment and 200 Gbps network connectivity, enterprises gain state-of-the-art, accelerated, fully programmable networking that implements zero trust security to protect against breaches, and isolate users and data, with bare-metal performance.

Every DGX SuperPOD now has this capability with the integration of two NVIDIA BlueField-2 DPUs in each DGX A100 node within it. IT administrators can use the offload, accelerate and isolate capabilities of NVIDIA BlueField DPUs to implement secure multi-tenancy for shared AI infrastructure without impacting the AI performance of the DGX SuperPOD.

Infrastructure Management with Base Command Manager

Every week, NVIDIA manages thousands of AI workloads executed on our internal DGX SATURNV infrastructure, which includes over 2,000 DGX systems. To date, we’ve run over 1.2 million jobs on it supporting over 2,500 developers across more than 200 teams. We’ve also been developing state-of-the-art infrastructure management software that ensures every NVIDIA developer is fully productive as they perform their research and develop our autonomous systems technology, robotics, simulations and more.

The software supports all this work, simplifies and streamlines management, and lets our IT team monitor health, utilization, performance and more. We’re adding this same software, called NVIDIA Base Command Manager, to DGX SuperPOD so businesses can run their environments the way we do. We’ll continuously improve Base Command Manager, delivering the latest innovations to customers automatically.

White-Glove Services

Deploying AI infrastructure is more than just installing servers and storage in data center racks. When a business decides to scale AI, they need a hand-in-glove experience that guides them from design to deployment to operationalization, without burdening their IT team to figure out how to run it, once the “keys” are handed over.

With DGX SuperPOD White Glove Services, customers enjoy a full lifecycle services experience that’s backed by proven expertise from install to operations. Customers benefit from pre-delivery performance certified on NVIDIA’s own acceptance cluster, which validates the deployed system is running at specification before it’s handed off.

White Glove Services also include a dedicated multidisciplinary NVIDIA team that covers everything from installation to infrastructure management to workflow to addressing performance-impacting bottlenecks and optimizations. The services are designed to give IT leaders peace of mind and confidence as they entrust their business to DGX SuperPOD.

DGX SuperPOD at GTC21

To learn more about DGX SuperPOD and how you can consolidate AI infrastructure and centralize development across your enterprise, check out our session presented by Charlie Boyle, vice president and general manager of DGX Systems, who will cover our DGX SuperPOD news and more in two separate sessions at GTC:

Register for GTC, which runs through April 16, for free.

Learn more: