# INNER GEEK: A CASUAL PROGRAMMER LOOKS AT CUDA

by

For the past several months, I have been writing posts about CUDA C programming and general purpose computing on GPUs under the “The World Is Parallel” heading on this blog. A while back, I decided it was time for me to get a better idea of what I was talking about by learning at least the rudiments of CUDA programming. So I got myself a copy of Programming Massively Parallel Processors: A Hands-on Approach by David Kirk and Wen-mei Hwu and set to work.

I should start by saying that although I have worked with computers since the mid-1960s, when I learned to program an IBM 7090 in an Algol variant called the Michigan Algorithm Decoder, I haven’t written anything resembling a serious chunk of code in 25 years. But I have followed developments in programming practice and know just enough to be dangerous. My C is like my French; I understand it a lot better than I speak it. My interest in learning CUDA is academic, and my activities have been limited to messing around and understanding sample programs.

CUDA C, for the uninitiated, is for running general-purpose code on NVIDIA GPUs. It is implemented as a set of extensions to common programming languages and its software development kit can be used with such standard programming environments as Microsoft Visual Studio and Apple Xcode.

The multiple processor cores in a GPU work differently from multiple CPU cores: GPU cores operate in a parallel fashion, each performing the same set of operations. Thus, to take advantage of CUDA C, you need a problem in which multiple operations can be carried out independently on different elements of a data set, known as “threads” in CUDA. (These are only vaguely related to CPU threads.)

The central example used by Kirk and Hwu is an familiar one, matrix multiplication. If you recall your math, this requires taking the dot product of each possible pair of rows from one matrix and columns from the other. For two NxN square matrices, this requires N multiplications and N-1 additions for each of the N-squared dot products. It’s ideal for CUDA C programming because each dot product is identical except for the numbers involved and all of the steps are independent of each other. The code can be written in a way that I found very easy to understand while still showing the capabilities of CUDA.

If you have a problem that can be readily parallelized in this way, CUDA C does not strike me to as especially difficult to learn. There’s some unfamiliar terminology. A function that runs on the GPU is called a “kernel.” Threads are organized into a hierarchy of “warps”, “blocks”, and “grids.” But I found that once I grasped the system by which threads are indexed, it’s pretty straightforward.

CUDA C programming does emphasize some things that I think many older programmers have forgotten and younger ones never learned. Programmers who want to achieve highly optimized routines need to have a certain intimacy with the workings of your computer .When working with matrices, for example, you have to know how the elements are stored internally, and it turns out that C and FORTRAN do it differently. Data must be arranged into blocks of up to 512 threads and blocks can be stacked into grids. How many threads can actually be processed simultaneously depends on the actual GPU on the system. Fortunately, CUDA takes care of the housekeeping of managing this at run time, so the programmer doesn’t have to worry about it, but performance can be optimized if you know how many cores will be available on the hardware.

It helps to understand memory management. In the early days of computers, programmers sweated every bit; I think that IBM 7090 had about 128 kilobytes of random-access memory. In recent years, the availability of vast amounts of memory has made everyone a bit sloppy. CUDA C brings back memory discipline. At the risk of my oversimplifying the complicated subject of CUDA memory, the programmer has a choice between shared and global memory. Shared memory is fast but tiny. Global memory is huge, but accessing it imposes a huge performance penalty. CUDA C rewards programmers who can figure out clever ways to minimize global memory accesses, perhaps by arranging things so that bits of code that run consecutively use some of the same data. This requires the programmer to pay a lot of attention to the order in which operations are performed to squeeze out maximum performance.

CUDA C is available as a free download for Windows, Mac OS X, and Linux. Even if you are a casual programmer—and they don’t get much more casual than myself—I urge you to give it a look. It’s fascinating in its own right, and also will give you insight into general purpose GPU programming, a technology that promises huge performance increases.