by Stephen Jones

We promised that the Kepler-based NVIDIA Tesla K20 GPU – first unveiled in May at the GPU Technology Conference (GTC) – would be the highest-performance processor the HPC industry has ever seen. One reason: support for a technology called “Dynamic Parallelism,” that can speed up a wide range of tasks.

Dynamic Parallelism allows the GPU to operate more autonomously from the CPU by generating new work for itself at run-time, from inside a kernel.  The concept is simple, but the impact is powerful: it can make GPU programming easier, particularly for algorithms traditionally considered difficult for GPUs such as divide-and-conquer problems.

To showcase its potential, I will use Quicksort—a universal requirement for all Computer Science 101 students—to show how Dynamic Parallelism cuts the lines of code needed for Quicksort in half while improving performance by 2x.

Under the Hood

Let’s begin with a bit of background. On GPUs based on the current Fermi architecture, there exists a one-way, fixed execution flow from the host CPU to the cores in the GPU. This is illustrated on the left side of the chart below.

Without dynamic parallelism on left; with on the right.
(Left): Without Dynamic Parallelism, (Right): With Dynamic Parallelism

With Dynamic Parallelism, the GPU is able to generate new work for itself without involving the CPU at all. This permits dynamic run-time decisions about what to do next, enabling much more complex algorithms than previously were possible (illustrated on the right side of the chart), while simultaneously releasing the CPU to conserve power or perform other work.

To handle this dynamic work, NVIDIA created a new hardware technology in Tesla K20 GPUs called the Grid Management Unit (GMU). This manages the complexities of dynamic execution at hardware speed – launching, suspending and resuming kernels, as well as tracking dependencies from multiple sources. A layer of system software running on the GPU interacts with the GMU, enabling the CUDA Runtime application-programming interface (API) to be used from within a kernel program.

Quick and Dirty with the Quicksort Algorithm

So now let’s move on to the Quicksort algorithm, which provides a great example of the power of Dynamic Parallelism.

First, a quick reminder of how it works. The goal is to sort an array of numbers, and I begin by picking a “pivot” value which I use to partition my array into two smaller arrays: one with values less than the pivot, and one with values equal or greater.

In the diagram below, I’m simply using the first element of each array as its pivot:

Using the first element of each array as its pivot.

After partitioning the initial array, the algorithm then launches two new quick sorts on the two new arrays, producing four sub-arrays and so on until each sub-array contains just a single value; the result is put together and you’re done. It’s a classic “divide-and-conquer” algorithm because it breaks the problem into ever smaller pieces and solves them recursively.

Quicksort Made Easy – Cutting Lines of Code in Half

Now let’s take a look at the actual CUDA code for Quicksort, with and without Dynamic Parallelism.

Quicksort with Dynamic Parallelism

 Quicksort with dynamic parallelism.

Quicksort without Dynamic Parallelism

 Quicksort without dynamic parallelism.

Even if you aren’t a programmer you’ll notice that Quicksort with Dynamic Parallelism is half the size of the code without it. And it’s much easier to follow.  Here’s why.

In Quicksort, the information needed to sort each stage depends on the stage before it.  Without Dynamic Parallelism all of the launches must take place from the CPU, which means that the details of what to launch next must be passed back to the host after each stage. For simplicity, the example encapsulates this communication in a CPU/GPU work stack; this can be highly complex in its own right, requiring atomics, data management, and as much code as the Quicksort algorithm itself.

But, with Dynamic Parallelism the GPU performs its own launches on-the-fly, enabling each Quicksort to launch its two sub-sorts as soon as it has finished. There are no complex overheads like the CPU/GPU stack exchange, and no need for all the host code which manages the launches. The whole thing is shorter, easier to understand and as we shall see next, faster.

Dynamic Parallelism Boosts Performance

We benchmarked the above two approaches on the same Tesla K20 GPU, and the results are shown in the graph below: Quicksort with Dynamic Parallelism delivered a 2x speed-up compared to the code without Dynamic Parallelism.

Quicksort performance results.

The reason for the speedup is closely connected to launch strategy. The CPU-controlled code must wait for each stage to complete before launching into the next stage, requiring a cudaDeviceSynchronize() call at each stage.  Not only is this is a heavy-duty operation, but it forces all sorts in a stage to finish before any sub-sort can begin – in effect, each stage goes as slowly as its longest operation.

By contrast, the dynamic parallel code simply launches work as and when it is needed. There’s no need to transfer data between GPU and CPU. There’s no need to wait for each stage to complete before starting the next. We get much better overlap of work with much lower management overhead.

As you can see, not only is the code much easier to write—and read—but it’s also significantly faster.

Limitless Possibilities

I believe Dynamic Parallelism will revolutionize GPU computing by delivering three powerful benefits:

  1. Programming the GPU will be easier than ever;
  2. Algorithms previously considered difficult will now accelerate easily on GPUs;
  3. The GPU depends significantly less on the CPU, enabling both to operate more efficiently.

With Dynamic Parallelism, the possibilities are endless. Over the next few weeks, I will write about two more of these powerful use cases: implementing complex algorithms by calling parallel libraries directly from the GPU, and maximizing GPU utilization by easily batching lots of small jobs together.

If you have any suggestions for codes that would benefit from Dynamic Parallelism please share them below. We are in the process of putting sample codes into the CUDA Toolkit, and would like to hear your ideas.

For more on Kepler and some of its cool features follow @NVIDIATesla.

Similar Stories

  • rtfss none

    First congratulations for sharing such early experiences on GK110 with all people..

    Hope you have time to answer some questions/suggestions (and hope aren’t much stupid after all):

    *to me the code is not the same as without dyn parallelism as you could also use 2 different streams and then should concurrent kernel execution (which should be working with Hyper-Q avoiding false dependencies).. should with that optimizations be even 2x speedup?

    *Is this sample scaling over GK104 I mean is code without dyn parallelism scaling over GK104 nearly proportionaly to  #SMs GK110 / #SMs GK104 assuming same SM clocks?

    *Seems this “simple quick sort code” is not the most efficient.. I mean it would not achieve peak speed of Duane Merill sort code integrated into Thrust.. Do you expect using dynamic parallelism code in the highest perf codes such as Duane Merill or MSORT to bring also such high speedups (2x) over current code on GK110.. i.e. can we expect thanks to that very high speedups in sorting in GK110 HW..

    *Seems currently best sorting rates for “small” arrays are by a wide margin achieved using HotSort from PixelIO.. Hope Nvidia and PixelIO collaborate to bring even further speed sorting codes using GK110 features..
    which brings to next question (sorry is not the correct location to post hope you can redirect or inform me whre to post while cuda forums are offline):

    *I think NV should should start something as Intel ManyCore Testing Lab where they put very $$$ HW (like 40cores on one host and son they said will have Xeons Phi) for testing remotely code for free to academics.. Hope you can put some Tesla K20s on start some service for students who can’t access  

    *Hope you put this simple on the SDK as you say..
    And for suggestions in SDK using Dyn parrallelism a “simple” multigrid code for say solving Poisson equation using Dyn parrallelism would be okay..

  • @heg53

    I have a new algorithm of my authorship. it is based on the principle of divide and conquer and I have programmed in C language. I think that could be calculated in this architecture. Do you could program for K-20 if I send the code?.

  • nvjones

     Thanks for the detailed response! You’ve asked a lot of questions, so I’ll work through them one at a time (I’m numbering based on your * comments above).

    1. The “without dynamic parallelism” code does use separate streams for each launch of a stage (see the last 3 lines of the code sample), so this should be a fair comparison. The limiter is not concurrency between kernels of a given stage, but rather that each stage must finish before the CPU can launch the next one.

    2. The
    graph shows both runs performed on a K20. I have not compared the host-launch
    algorithm performance between K10 and K20, although I would expect to see
    equivalent behaviour because the limiting factor is the inter-stage
    synchronisation overhead.

    3. Quicksort
    is a comparison sort, which allows sorting of arbitrary data, so it cannot be
    compared with Radix Sort which can only compare bitfield sorts (for example,
    you could not radix-sort complex numbers based on modulus). The time complexity
    of the two algorithms is different because they perform different tasks. There is no need to use dynamic parallelism for bitfield sorts such as
    radix sort, because there is no intrinsic data dependence and the GPU already
    performs very well at these.

    4. I’m afraid I can’t comment on HotSort’s approach, although it appears that they also do a bitfield sort and so again it is not directly comparable.

    5. Thank you for the suggestion.

    6. The
    SDK for CUDA 5.0 will indeed have both a simple and an advanced Quicksort
    sample. These are designed to illustrate the programming model so we’ve kept
    them as simple as possible, but you’ll be able to see how a basic partitioning
    function would work.Thanks again for good feedback!

  • nvjones

    We’re always interested to hear about algorithms where dynamic parallelism might apply – could you be more specific about what you are working on? I’m afraid I won’t personally be able to help with porting your code, but I’ve often found the people on to offer good advice and help.

  • rtfss none

     Really thanks for your time and detailed response.. it shows I have more excitement than knowledge in fast GPU sorters 🙂

    I won’t promise you but I think this is the last mega post here so I won’t take more of your time.. I think I have some good suggestions for 5.0 SDK I have been thinking lately.. perhaps sorry posting here but as said using Nvidia forums is no option now..

    Before that only say I’m happy NV is well recieving the suggestion on starting somthing like Intel Many Core testing lab.. hope it materializes soon after time of Tesla K20s release..

    Here are the suggestions for CUDA 5.0 SDK:
    (mostly I’m interested in graphics/CUDA interop for questions clearer at the bottom)

    *Seems CUDA 5.0RC ships with CUDA BLAS device library for GK110 altough no documentation present currently and perhaps even not support in CUBLAS headers for using it.. Hope this gets fixed and a simple simple using CUBLAS on device shows..

    *Would be good to ship some simple HyperQconckernel example based perhaps on concurrentkernel example in SDK that shows some case where Fermi conc kernel execution isn’t exploitable based false serializations in single HW queue wouldn’t it?

    *Seems new CUDA 5.0 texture object is for exposing Kepler bindless textures.. but if that objects can be created from OpenGL “regular” textures or even OGL bindless texture (via NV_bindless_texture) is a good question.. so would be good:
    ->a sample showing creation of CUDA texture objects from OGL regular and new OGL bindless textures..

    *Seems new CUDA 5.0 texture object (for using bindless tex and surfaces) allows using compressed tex formats (even new BPTC format) using cudaResourceViewDesc option.. Shame is that texture objects is Kepler feature only and I don’t have any Kepler right now but hope that is working.. so here some suggestions on compressed tex support in CUDA:

    ->ship some sample in SDK showing usage of compressed texes via texture objects
    ->in same sample or new show creation of tex object of OpenGL/D3D  compressed textures.. I don’t know if that is posible right now if not for future CUDA versions..

    now for future CUDA versions:

    ->allow compressed textures using “standard” textures.. this would allow working with that on Fermi and older GPUs..

    *In this blog NV has show also good scaling of MPI codes via HyperQ and seems that exploits some multiple host processes via single CUDA context tech using nv-

    proxy-control nv-proxy-server tech..
    make sense to expose on SDK documentation and/or example of exploting that feature on general processes not related to MPI codes i.e. how to exploit bassically

    HyperQ feature and concurrent kernels feature to execute concurrently multplie GPU processes via one CUDA context..

    *Can we expect to see in final SDK a sample exploiting the new H.264 HW encoder in Kepler GPUs.. would be good to expose a sample that shows how to directly

    encode visualization of GPU simulation results in a video stream .. for example extending Nbody sample that compressed an H.264 video of what is shown in screen

    would be a good example of H.264 HW encoder is independent of GPU SMs and don’t affect per of simulation

    Now I finish.. mostly..

    I’m interested in CudaRaster (opengl like via CUDA), Optix (raytracing language via CUDA) and VoxelPipe (3d raster i.e. voxelization via CUDA with shaders) so I think that for exetended programable pipelines to be a success one great step is to expose all fucntionality of graphics shading languages to CUDA so for that cause I’m expecting/suggesting future CUDA versions to expose (I think not supported now):
    *Multisample textures with new cuda functions(tex2DMSAA() and the like)
    *support for creating textures from OpenGL/D3D depth buffer (that is for hybrid raytracing right now requires depth to color copy in graphic APIs)
    *Compressed texes for Fermi and below: not only via object textures..
    *I think new OGL 4/D3D 11 gather4 instuctions aren’t in CUDA also yet..
    Please see for another guy interested in this (altough he is interested in OpenCL support)..

    Many thanks..

  • @heg53

    Very nice for taking the time to answer. With pleasure I send an article and source code in C language for you to see if you can run it in this architecture. I hope I can be useful and we can use it as an example of numerical efficiency. The numerical complexity of this algorithm is exponential and if all goes well may be reduced to polynomial complexity. If everything turned out well, I would expect us to publish something and you could advertise the algorithm as an example.Could you send a mail where to send this safely?


  • Jackson Beatty

    One or more code samples using Thrust in the context of dynamic parallelism would be very helpful.

  • Cristobal Navarro

    one small question, you first call the kernel with one block of 1 thread? not sure about that.

  • Sagar Rawal

    Congratulations on a fantastic demonstration of the hardware prowess of Tesla K20!

    The anticipation builds in everyone for the release of such a groundbreaking product!

  • david macpherson


    Thanks for the quicksort example. Perhaps including a link to the partition function & a makefile would be helpful.

  • Biao Wang

    I have the same question, what is the grid configuration which is substituted by three dot  in each kernel call?

  • Jack Jones

    It would be nice to have the partition source code…

  • Jonh Rain

    Hello. I just got a GTX660 thinking that i could use dynamic parallelism, but it seems only the GTX TITAN can do it? is that correct?

  • tpofofnt


    In a recent talk, you made the comment that the overhead for launching a kernel on the device is precisely the same as the overhead for launching a kernel on the host.  You went on to say that if device kernel launches were batched, say in a batch of 250 launches, the overhead for each kernel launch would be 1/250 the overhead of a  host kernel launch.

    What is this batch kernel launch from the device you speak of?  Are you referring to the situation where every thread from a parent kernel launches the same child kernel?

    Thanks in advance!

    Mitch Horton

  • Dave Ojika

    well said!! exactly what I am looking for. did you find anything on that?

  • Jake

    There are several errors in the Dynamic Parallelism code.

    The logic to check if there is a left sub-array to sort is currently:

    if(left < nright)

    but it should be

    if(left < nleft)

    because when left == nleft, this implies there is no left sub-array to sort, the if-check returns false, and there is no quicksort call on the left sub-array.

    Likewise, the recursive quicksort call for the left sub-array is incorrect. It should be

    quicksort<<>> (data, left, nleft);

    as the left sub-array’s bounds go from [left,….,nleft].

    It’s a bit laughable that you include the partition function with no comment on how it is implemented, as it is far from trivial to implement in an efficient way, especially if trying to avoid breaking out of the kernel and coordinating with the CPU.

    The main issue is how to coordinate the concurrent writes of doing the quicksort in place. I think a good way to go about it is having each block load its portion of the array into shared memory, do an in-place quicksort in shmem, keep track of the number of the values in shmem less than and greater than the pivot, then atomically update a pointer into the global array of where each block can write its left/right values. You would have two atomically updated pointers, a left and a right. Each block updates the pointer based on the size of its left and right portions in shmem so that the next block to update the pointer knows where it can safely write its shmem values to the global array.