NVIDIA Accelerates Science and Engineering With CUDA-X Libraries Powered by GH200 and GB200 Superchips

Scientists and engineers of all kinds are equipped to solve tough problems a lot faster with NVIDIA CUDA-X libraries powered by NVIDIA GB200 and GH200 superchips.

Announced today at the NVIDIA GTC global AI conference, developers can now take advantage of tighter automatic integration and coordination between CPU and GPU resources — enabled by CUDA-X working with these latest superchip architectures — resulting in up to 11x speedups for computational engineering tools and 5x larger calculations compared with using traditional accelerated computing architectures.

This greatly accelerates and improves workflows in engineering simulation, design optimization and more, helping scientists and researchers reach groundbreaking results faster.

NVIDIA released CUDA in 2006, opening up a world of applications to the power of accelerated computing. Since then, NVIDIA has built more than 900 domain-specific NVIDIA CUDA-X libraries and AI models, making it easier to adopt accelerated computing and driving incredible scientific breakthroughs. Now, CUDA-X brings accelerated computing to a broad new set of engineering disciplines, including astronomy, particle physics, quantum physics, automotive, aerospace and semiconductor design.

The NVIDIA Grace CPU architecture delivers a significant boost to memory bandwidth while reducing power consumption. And NVIDIA NVLink-C2C interconnects provide such high bandwidth that the GPU and CPU can share memory, allowing developers to write less-specialized code, run larger problems and improve application performance.

Accelerating Engineering Solvers With NVIDIA cuDSS

NVIDIA’s superchip architectures allow users to extract greater performance from the same underlying GPU by making more efficient use of CPU and GPU processing capabilities.

The NVIDIA cuDSS library is used to solve large engineering simulation problems involving sparse matrices for applications such as design optimization, electromagnetic simulation workflows and more. cuDSS uses Grace CPU memory and the high-bandwidth NVLink-C2C interconnect to factorize and solve large matrices that normally wouldn’t fit in device memory. This enables users to solve extremely large problems in a fraction of the time.

The coherent shared memory between the GPU and Grace CPU minimizes data movement, significantly reducing overhead for large systems. For a range of large computational engineering problems, tapping the Grace CPU memory and superchip architecture accelerated the most heavy-duty solution steps by up to 4x with the same GPU, with cuDSS hybrid memory.

Ansys has integrated cuDSS into its HFSS solver, delivering significant performance enhancements for electromagnetic simulations. With cuDSS, HFSS software achieves up to an 11x speed improvement for the matrix solver.

Altair OptiStruct has also adopted the cuDSS Direct Sparse Solver library, substantially accelerating its finite element analysis workloads.

These performance gains are achieved by optimizing key operations on the GPU while intelligently using CPUs for shared memory and heterogeneous CPU and GPU execution. cuDSS automatically detects areas where CPU utilization provides additional benefits, further enhancing efficiency.

Scaling Up at Warp Speed With Superchip Memory

Scaling memory-limited applications on a single GPU becomes possible with the GB200 and GH200 architectures’ NVLink-C2C interconnects that provide CPU and GPU memory coherency.

Many engineering simulations are limited by scale and require massive simulations to produce the resolution necessary to design equipment with intricate components, such as aircraft engines. By tapping into the ability to seamlessly read and write between CPU and GPU memories, engineers can easily implement out-of-core solvers to process larger data.

For example, using NVIDIA Warp —a Python-based framework for accelerating data generation and spatial computing applications — Autodesk performed simulations of up to 48 billion cells using eight GH200 nodes. This is more than 5x larger than the simulations possible using eight NVIDIA H100 nodes.

Powering Quantum Computing Research With NVIDIA cuQuantum

Quantum computers promise to accelerate problems that are core to many science and industry disciplines. Shortening the time to useful quantum computing rests heavily on the ability to simulate extremely complex quantum systems.

Simulations allow researchers to develop new algorithms today that will run at scales suitable for tomorrow’s quantum computers. They also play a key role in improving quantum processors, running complex simulations of performance and noise characteristics of new qubit designs.

So-called state vector simulations of quantum algorithms require matrix operations to be performed on exponentially large vector objects that must be stored in memory. Tensor network simulations, on the other hand, simulate quantum algorithms through tensor contractions and can enable hundreds or thousands of qubits to be simulated for certain important classes of applications.

The NVIDIA cuQuantum library accelerates these workloads. cuQuantum is integrated with every leading quantum computing framework, so all quantum researchers can tap into simulation performance with no code changes.

Simulations of quantum algorithms are generally limited in scale by memory requirements. The GB200 and GH200 architectures provide an ideal platform for scaling up quantum simulations, as they enable large CPU memory to be used without bottlenecking performance. A GH200 system is up to 3x faster than an H100 system with x86 on quantum computing benchmarks.

Learn more about CUDA-X libraries, attend the GTC session on how math libraries can help accelerate applications on NVIDIA Blackwell GPUs and watch NVIDIA founder and CEO Jensen Huang’s GTC keynote.