by Will Ramey

The CUDA 4.0 release candidate is exciting because it includes several significant new features that increase the flexibility of the CUDA programming model without sacrificing forward compatibility.  This release is all about making it easier than ever for developers to upgrade existing CPU-only applications to massively parallel applications that make best use of both the CPU and the GPU.

One of the key innovations in this release is Unified Virtual Addressing.  UVA takes advantage of the 64-bit addressing in Fermi architecture GPUs to provide applications with a single contiguous address space that includes all of the CPU and GPU memory in the system.   Developers can now design and use much simpler interfaces because they don’t have to distinguish between host pointers and device pointers.  Now, even on dual-socket systems with several GPUs, a pointer is a pointer is a pointer.

Another very cool – and highly requested – feature in this release is NVIDIA GPUDirect peer-to-peer communications.  As systems with multiple GPUs have become increasingly common, more developers asked for the ability to transfer data directly between GPUs (peer-to-peer) without an intermediate copy to system memory.  In this release, GPUDirect v2.0 enables applications to use the shortest possible path to transfer data directly between GPUs, and also enables kernels running on one GPU to directly read or write memory attached to another GPU in the system.  I’m looking forward to seeing all the amazing things developers will accomplish with these new capabilities.

And, here’s a quick summary of all the key features in CUDA Toolkit v4.0:

Easier Application Porting:

  • Share GPUs across multiple host threads (OpenMP, pthreads, etc.)
  • Use all GPUs in the system concurrently from a single host thread
  • Pin the system memory you already have and just copy your data directly to/from the GPU
  • Dynamic GPU memory management using C++ new/delete in device code
  • New libraries in the CUDA Toolkit make it easy to get the benefits of GPU acceleration without having to write all of the code yourself
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • NVIDIA Performance Primitives (NPP) library for image/video processing

Faster Multi-GPU Programming

  • Unified Virtual Addressing
  • GPUDirect v2.0 support for Peer-to-Peer Communication

New & Improved Developer Tools

  • Automated Performance Analysis in Visual Profiler
  • C++ debugging in cuda-gdb
  • GPU binary disassembler for Fermi architecture (cuobjdump)

We’ve also published a presentation with additional details and diagrams at:

So, whether you’re already a seasoned CUDA developer or just getting started porting your applications to the GPU, the new CUDA 4.0 release has some great new features for you.  Please give them a try, and let us know what you’d like us to add or improve next.

The CUDA Toolkit 4.0 release candidate is available today to thousands of registered developers.  If you haven’t signed up yet, please take a few minutes to complete the free application form: