The industrial age was fueled by steam. The digital age brought a shift through software. Now, the AI age is marked by the development of generative AI, agentic AI and AI reasoning, which enables models to process more data to learn and reason to solve complex problems.
Just as industrial factories transform raw materials into goods, modern businesses require AI factories to quickly transform data into insights that are scalable, accurate and reliable.
Orchestrating this new infrastructure is far more complex than it was to build steam-powered factories. State-of-the-art models demand supercomputing-scale resources. Any downtime risks derailing weeks of progress and reducing GPU utilization.
To enable enterprises and developers to manage and run AI factories at light speed, NVIDIA today announced at the NVIDIA GTC global AI conference NVIDIA Mission Control — the only unified operations and orchestration software platform that automates the complex management of AI data centers and workloads.
NVIDIA Mission Control enhances every aspect of AI factory operations. From configuring deployments to validating infrastructure to operating developer workloads, its capabilities help enterprises get frontier models up and running faster.
It is designed to easily transition NVIDIA Blackwell-based systems from pretraining to post-training — and now test-time scaling — with speed and efficiency. The software enables enterprises to easily pivot between training and inference workloads on their Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems, dynamically reallocating cluster resources to match shifting priorities.
In addition, Mission Control includes NVIDIA Run:ai technology to streamline operations and job orchestration for development, training and inference, boosting infrastructure utilization by up to 5x.
Mission Control’s autonomous recovery capabilities, supported by rapid checkpointing and automated tiered restart features, can deliver up to 10x faster job recovery compared with traditional methods that rely on manual intervention, boosting AI training and inference efficiency to keep AI applications in operation.
Built on decades of NVIDIA supercomputing expertise, Mission Control lets enterprises simply run models by minimizing time spent managing AI infrastructure. It automates the lifecycle of AI factory infrastructure for all NVIDIA Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems from Dell Technologies, Hewlett Packard Enterprise (HPE), Lenovo and Supermicro to make advanced AI infrastructure more accessible to the world’s industries.
Enterprises can further simplify and speed deployments of NVIDIA DGX GB300 and DGX B300 systems by using Mission Control with the NVIDIA Instant AI Factory service preconfigured in Equinix AI-ready data centers across 45 markets globally.
Advanced Software Provides Enterprises Uninterrupted Infrastructure Oversight
Mission Control automates end-to-end infrastructure management — including provisioning, monitoring and error diagnosis — to deliver uninterrupted operations. Plus, it continuously monitors every layer of the application and infrastructure stack to predict and identify sources of downtime and inefficiency — saving time, energy and costs.
Additional NVIDIA Mission Control software benefits include:
- Simplified cluster setup and provisioning with new automation and standardized application programming interfaces to speed time to deployment with integrated inventory management and visualizations.
- Seamless workload orchestration for simplified Slurm and Kubernetes workflows.
- Energy-optimized power profiles to balance power requirements and tune GPU performance for various workload types with developer-selectable controls.
- Autonomous job recovery to identify, isolate and recover from inefficiencies without manual intervention to maximize developer productivity and infrastructure resiliency.
- Customizable dashboards that track key performance indicators with access to critical telemetry data about clusters.
- On-demand health checks to validate hardware and cluster performance throughout the infrastructure lifecycle.
- Building management integration for enhanced coordination with building management systems to provide more control for power and cooling events, including rapid leakage detection.
Leading System Makers Bring NVIDIA Mission Control to Grace Blackwell Servers
Leading system makers plan to offer NVIDIA GB200 NVL72 and GB300 NVL72 systems with NVIDIA Mission Control.
Dell plans to offer NVIDIA Mission Control software as part of the Dell AI Factory with NVIDIA.
“The AI industrial revolution demands efficient infrastructure that adapts as fast as business evolves, and the Dell AI Factory with NVIDIA delivers with comprehensive compute, networking, storage and support,” said Ihab Tarazi, chief technology officer and senior vice president at Dell Technologies. “Pairing NVIDIA Mission Control software and Dell PowerEdge XE9712 and XE9680 servers helps enterprises scale models effortlessly to meet the demands of both training and inference, turning data into actionable insights faster than ever before.”
HPE will offer the NVIDIA GB200 NVL72 by HPE and GB300 NVL72 by HPE systems with NVIDIA Mission Control software.
“We are helping service providers and cutting-edge enterprises to rapidly deploy, scale, and optimize complex AI clusters capable of training trillion parameter models,” said Trish Damkroger, senior vice president and general manager, HPC & AI Infrastructure Solutions at HPE. “As part of our collaboration with NVIDIA, we will deliver NVIDIA Grace Blackwell rack-scale systems and Mission Control software with HPE’s global services and direct liquid cooling expertise to power the new AI era.”
Lenovo plans to update its Lenovo Hybrid AI Advantage with NVIDIA systems to include NVIDIA Mission Control software.
“Bringing NVIDIA Mission Control software to Lenovo Hybrid AI Advantage with NVIDIA systems empowers enterprises to navigate the demands of generative and agentic AI workloads with unmatched agility,” said Brian Connors, worldwide vice president and general manager of enterprise and SMB segment and AI, infrastructure solutions group, at Lenovo. “By automating infrastructure orchestration and enabling seamless transitions between training and inference workloads, Lenovo and NVIDIA are helping customers scale AI innovation at the speed of business.”
Supermicro plans to incorporate NVIDIA Mission Control software into its Supercluster systems.
“Supermicro is proud to team with NVIDIA on a Grace Blackwell NVL72 system that is fully supported by NVIDIA Mission Control software,” Cenly Chen, chief growth officer at Supermicro. “Running on Supermicro’s AI SuperCluster systems with NVIDIA Grace Blackwell, NVIDIA Mission Control software provides customers with a seamless management software suite to maximize performance on both current NVIDIA GB200 NVL72 systems and future platforms such as NVIDIA GB300 NVL72.”
Base Command Manager Offers Free Kickstart for AI Cluster Management
To help enterprises with infrastructure management, NVIDIA Base Command Manager software is expected to soon be available for free for up to eight accelerators per system, for any cluster size, with the option to purchase NVIDIA Enterprise Support separately.
Availability
NVIDIA Mission Control for NVIDIA DGX GB200 and DGX B200 systems is available now. NVIDIA GB200 NVL72 systems with Mission Control are expected to soon be available from Dell, HPE, LeNewfonovo and Supermicro.
NVIDIA Mission Control is expected to become available for the latest NVIDIA DGX GB300 and DGX B300 systems, as well as GB300 NVL72 systems from leading global providers, later this year.
See notice regarding software product information.