We promised that the Kepler-based NVIDIA Tesla K20 GPU – first unveiled in May at the GPU Technology Conference (GTC) – would be the highest-performance processor the HPC industry has ever seen. One reason: support for a technology called “Dynamic Parallelism,” that can speed up a wide range of tasks.
Dynamic Parallelism allows the GPU to operate more autonomously from the CPU by generating new work for itself at run-time, from inside a kernel. The concept is simple, but the impact is powerful: it can make GPU programming easier, particularly for algorithms traditionally considered difficult for GPUs such as divide-and-conquer problems.
To showcase its potential, I will use Quicksort—a universal requirement for all Computer Science 101 students—to show how Dynamic Parallelism cuts the lines of code needed for Quicksort in half while improving performance by 2x.
Under the Hood
Let’s begin with a bit of background. On GPUs based on the current Fermi architecture, there exists a one-way, fixed execution flow from the host CPU to the cores in the GPU. This is illustrated on the left side of the chart below.
With Dynamic Parallelism, the GPU is able to generate new work for itself without involving the CPU at all. This permits dynamic run-time decisions about what to do next, enabling much more complex algorithms than previously were possible (illustrated on the right side of the chart), while simultaneously releasing the CPU to conserve power or perform other work.
To handle this dynamic work, NVIDIA created a new hardware technology in Tesla K20 GPUs called the Grid Management Unit (GMU). This manages the complexities of dynamic execution at hardware speed – launching, suspending and resuming kernels, as well as tracking dependencies from multiple sources. A layer of system software running on the GPU interacts with the GMU, enabling the CUDA Runtime application-programming interface (API) to be used from within a kernel program.
Quick and Dirty with the Quicksort Algorithm
So now let’s move on to the Quicksort algorithm, which provides a great example of the power of Dynamic Parallelism.
First, a quick reminder of how it works. The goal is to sort an array of numbers, and I begin by picking a “pivot” value which I use to partition my array into two smaller arrays: one with values less than the pivot, and one with values equal or greater.
In the diagram below, I’m simply using the first element of each array as its pivot:
After partitioning the initial array, the algorithm then launches two new quick sorts on the two new arrays, producing four sub-arrays and so on until each sub-array contains just a single value; the result is put together and you’re done. It’s a classic “divide-and-conquer” algorithm because it breaks the problem into ever smaller pieces and solves them recursively.
Quicksort Made Easy – Cutting Lines of Code in Half
Now let’s take a look at the actual CUDA code for Quicksort, with and without Dynamic Parallelism.
Quicksort with Dynamic Parallelism
Quicksort without Dynamic Parallelism
Even if you aren’t a programmer you’ll notice that Quicksort with Dynamic Parallelism is half the size of the code without it. And it’s much easier to follow. Here’s why.
In Quicksort, the information needed to sort each stage depends on the stage before it. Without Dynamic Parallelism all of the launches must take place from the CPU, which means that the details of what to launch next must be passed back to the host after each stage. For simplicity, the example encapsulates this communication in a CPU/GPU work stack; this can be highly complex in its own right, requiring atomics, data management, and as much code as the Quicksort algorithm itself.
But, with Dynamic Parallelism the GPU performs its own launches on-the-fly, enabling each Quicksort to launch its two sub-sorts as soon as it has finished. There are no complex overheads like the CPU/GPU stack exchange, and no need for all the host code which manages the launches. The whole thing is shorter, easier to understand and as we shall see next, faster.
Dynamic Parallelism Boosts Performance
We benchmarked the above two approaches on the same Tesla K20 GPU, and the results are shown in the graph below: Quicksort with Dynamic Parallelism delivered a 2x speed-up compared to the code without Dynamic Parallelism.
The reason for the speedup is closely connected to launch strategy. The CPU-controlled code must wait for each stage to complete before launching into the next stage, requiring a cudaDeviceSynchronize() call at each stage. Not only is this is a heavy-duty operation, but it forces all sorts in a stage to finish before any sub-sort can begin – in effect, each stage goes as slowly as its longest operation.
By contrast, the dynamic parallel code simply launches work as and when it is needed. There’s no need to transfer data between GPU and CPU. There’s no need to wait for each stage to complete before starting the next. We get much better overlap of work with much lower management overhead.
As you can see, not only is the code much easier to write—and read—but it’s also significantly faster.
I believe Dynamic Parallelism will revolutionize GPU computing by delivering three powerful benefits:
- Programming the GPU will be easier than ever;
- Algorithms previously considered difficult will now accelerate easily on GPUs;
- The GPU depends significantly less on the CPU, enabling both to operate more efficiently.
With Dynamic Parallelism, the possibilities are endless. Over the next few weeks, I will write about two more of these powerful use cases: implementing complex algorithms by calling parallel libraries directly from the GPU, and maximizing GPU utilization by easily batching lots of small jobs together.
If you have any suggestions for codes that would benefit from Dynamic Parallelism please share them below. We are in the process of putting sample codes into the CUDA Toolkit, and would like to hear your ideas.
For more on Kepler and some of its cool features follow @NVIDIATesla.