by Will Ramey

The CUDA 4.0 release candidate is exciting because it includes several significant new features that increase the flexibility of the CUDA programming model without sacrificing forward compatibility.  This release is all about making it easier than ever for developers to upgrade existing CPU-only applications to massively parallel applications that make best use of both the CPU and the GPU.

One of the key innovations in this release is Unified Virtual Addressing.  UVA takes advantage of the 64-bit addressing in Fermi architecture GPUs to provide applications with a single contiguous address space that includes all of the CPU and GPU memory in the system.   Developers can now design and use much simpler interfaces because they don’t have to distinguish between host pointers and device pointers.  Now, even on dual-socket systems with several GPUs, a pointer is a pointer is a pointer.

Another very cool – and highly requested – feature in this release is NVIDIA GPUDirect peer-to-peer communications.  As systems with multiple GPUs have become increasingly common, more developers asked for the ability to transfer data directly between GPUs (peer-to-peer) without an intermediate copy to system memory.  In this release, GPUDirect v2.0 enables applications to use the shortest possible path to transfer data directly between GPUs, and also enables kernels running on one GPU to directly read or write memory attached to another GPU in the system.  I’m looking forward to seeing all the amazing things developers will accomplish with these new capabilities.

And, here’s a quick summary of all the key features in CUDA Toolkit v4.0:

Easier Application Porting:

  • Share GPUs across multiple host threads (OpenMP, pthreads, etc.)
  • Use all GPUs in the system concurrently from a single host thread
  • Pin the system memory you already have and just copy your data directly to/from the GPU
  • Dynamic GPU memory management using C++ new/delete in device code
  • New libraries in the CUDA Toolkit make it easy to get the benefits of GPU acceleration without having to write all of the code yourself
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • NVIDIA Performance Primitives (NPP) library for image/video processing

Faster Multi-GPU Programming

  • Unified Virtual Addressing
  • GPUDirect v2.0 support for Peer-to-Peer Communication

New & Improved Developer Tools

  • Automated Performance Analysis in Visual Profiler
  • C++ debugging in cuda-gdb
  • GPU binary disassembler for Fermi architecture (cuobjdump)

We’ve also published a presentation with additional details and diagrams at:

So, whether you’re already a seasoned CUDA developer or just getting started porting your applications to the GPU, the new CUDA 4.0 release has some great new features for you.  Please give them a try, and let us know what you’d like us to add or improve next.

The CUDA Toolkit 4.0 release candidate is available today to thousands of registered developers.  If you haven’t signed up yet, please take a few minutes to complete the free application form:

Similar Stories

  • Will Ramey

    A recording of this moring’s webinar is now available. More details here:

  • Arpan Maheshwari

    Hi Will,
    Will this peer-to-peer communication will be possible in all GPUs (tesla,quadro) or only in Fremi ?

  • Will Ramey

    @Arpan: the peer-to-peer feature requires some hardware features that were first introduced with the Fermi architecture, so P2P is only supported on Fermi-based Tesla and Quadro GPUs.

  • Fernando Arias


    Just wanted to say we’re really impressed with the significant changes and improvements in version 4.0.

    We haven’t gone through the release notes yet, but was limitation of single memory allocations on Win7 and Vista ever addressed? We know it’s a limitation of the Windows Display Driver Model and not the fault of the NVIDIA driver.

    The compute only driver seems like a good work around, but there is a possibility we will be deploying some code that requires D3D and compute capability.

  • Arpan

    Hi Will,
    I have a question related to CUDA in general, but I want to know if there is any help in CUDA 4.0

    I want to calculate 3D FFT (of very large order) using Multi GPUs(lets say 4 GPUs). So, my Idea is to first create 4 CPU openMP threads, divide and send the data on 4 GPUs, calculate 2D FFTs of the slices, then bring back the data to the CPU, then do transposition, again send the data to 4 GPUs and calculate 1D FFTs and then bring back the data to the CPU and do final transposition.

    According to the above plan, on GPUs, I want to fork some threads, that would calculate 2D FFTs and 1D FFTs. But the problem is one cannot call CUFFT inside kernel function( CUFFT functions are callable from the host functions).

    So, Any suggestion?
    Thanks in advance for the reply.


  • Will Ramey

    Hi Arpan,

    This would be a great question for the CUDA developer forums:

    You’ll get better information from other programmers than a marketing guy like me. 🙂

    BTW, there’s a copy of the webinar we did a couple weeks ago on CUDA 4.0 features here: