by Steve Wildstrom

This post is an introduction to a series of reports on computer scientists and other researchers who are unlocking the high-performance computing potential of parallel programming using large numbers of processor cores. But first some background on the opportunity and the challenge of parallel computing.

Some time around the middle of the last decade, the race to ever-faster computing hit the wall. Until then, designers had delivered soaring performance through three well-understood technologies: shrinking the already microscopic transistors, cramming more of them into each processor, and running them at higher speeds

The problem was that faster processor performance translated into higher power consumption and more heat, and even if you could find a way to get rid of the excess heat before the chips fried, continuation of the trend posed unacceptable economic and environmental costs.

An alternative route to faster computing had been around for some time. Instead of driving the processors harder, use more of them.  Mainframe computers and servers had long used multiple processors to handle heavy loads, but advances in chip technology made it possible to combine multiple processors on a single chip, an approach that is both more efficient and much cheaper. Today, high-performance computing is a story of dividing computational workloads over multiple processor cores. In the case of personal computers, this means both a handful of cores in the CPU and dozens, sometimes hundreds of cores in the graphics processing unit (GPU).

But multiprocessor hardware brings with it a significant software challenge. From the beginning of modern computing in the 1940s, programs had been designed to work sequentially. Funding, mostly by the Defense Advanced Projects Research Agency, produced some successes in systems with large numbers of processors designed to solve computations problems by breaking them into many pieces that could be run simultaneously, but these massively parallel systems never achieved commercial viability.

One reason is that most common computing problems, and the algorithms used to solve them, are not well suited to this sort of breakup. And sequential thinking seems to be wired into our brains. Neuroscientist Jill Bolte Taylor says the right hemisphere of the brain, which processes sensory signals, does parallel processing but the left hemisphere, which is responsible for analytic thinking, “functions like a serial processor.” For better or worse, programming is a left-brain activity.

The biggest mathematical impediment to parallel approaches is that many processes are recursive: each step depends on the result of previous steps. Consider the simple problem of finding the greatest common divisor of two integers. The standard method of doing this, the Euclidean algorithm, has been known for over 2,000 years and uses repeated subtraction.

Euclid For example, if you want to find the greatest common divisor of 2,987 and 1,751, start by subtracting 1,751 from 2,987. Repeatedly subtract the difference (switching the order if needed to prevent negative numbers) until the result is 0. In this case, the two numbers have a largest common factor of 103. It’s a beautiful and efficient process, but it is inherently sequential because each subtraction depends on the previous result.

The great exception to the dominance of serial thinking is graphics. A very simple, common, and typical need in graphics is the need to rotate an image. If you remember some trigonometry, you may recall a simple formula to rotate point counterclockwise through an angle & Theta:

Rotate The importance of this is that each point can be processed independently of every other point. If you had as many processors as points, the entire transformation could be computed in a single, massively parallel operation. And the same is true of many more complex graphics tasks.

The parallel-friendly nature of graphics work led to the early incorporation of a multi-processor architecture into graphics processing units (GPU). NVIDIA’s top-of-the-line Tesla GPUs currently feature 240 processor cores. While these cores are not as flexible as CPU processors, they excel at certain tasks, such as the vector operations that lie at the heart of many intense computational problems.

Software for effective utilization of large numbers of cores, both CPU and GPU, remains a challenge but things are getting better. NVIDIA helped lead the way with the CUDA parallel programming model which enabled general-purpose computation on NVIDIA GPUs and with extensions to the C programming language that made the processors accessible.   Developers can thus program NVIDIA’s CUDA GPUs using languages such as C, C++, and Fortran via the CUDA toolkit and PGI’s CUDA Fortran compiler respectively and also using multiple driver-level APIs such as OpenCL and DirectCompute.

One of the biggest challenges facing software developers is that to get more performance for existing applications and to develop new more compute-intensive applications, they have no choice but to consider parallelizing their applications, whether they choose multi-core CPUs or many core GPUs.   Based on the last few years of development, the CUDA parallel programming model has established itself as an “easier” way to do parallel programming (it still isn’t easy, but CUDA does make certain things easier).  Also, GPUs can offer a tremendous performance advantage over CPUs and so combining these two elements offers developers a way to develop more innovative applications.