Building a Super-Computer With a Power Drill and 18,688 GPUS

by Brian Caulfield

Al Enger has been a busy man. Enger is one of a crew of Cray engineers who have been working to assemble a massive new supercomputer at the Oak Ridge National Laboratory in Tennessee. It’s called Titan.

There are a lot of ways to measure Titan’s size. The machine is about as big as a basketball court. It contains 6,329 miles of interconnect cables. It’s cooled with 1,353 gallons of special refrigerant. Data is stored on 21,030 disks.

But the best way to understand Titan’s scale and complexity is to talk to one of the blue-coated engineers who scamper around its 200 towering black cabinets. Enger and his colleagues from supercomputer company Cray work under fluorescent lights as fans pump 1.3 million cubic feet of air per minute through the room. Ear plugs are recommended.

Mixing GPUs and CPUs makes Titan five times as efficient as its predecessor.
Mixing GPUs and CPUs makes Titan five times as efficient as its predecessor.

Two months ago pallets bearing the 18,688 NVIDIA Tesla GPUs that provide about 90% of the machine’s computing power began to arrive. That’s when Enger picked up his green and black power drill and got to work. It took Enger and 20 colleagues three weeks, working 7 days a week, to bolt all those GPUs into the machine.

The result could be the world’s most powerful computer. It won’t be official until November, when TOP500.Org releases its semi-annual list of 500 fastest supercomputers. But there can be no doubt Titan represents a breakthrough. At its peak, Titan cranks out more than 20 petaflops. That’s twenty thousand trillion floating point computations per second (‘floating point’ refers to a format many computers use to represent very small and very big numbers efficiently).

What’s really significant about Titan isn’t how many zeros you need to measure its performance, but how few megawatts Titan needs to do its work. Because it relies on GPUs to do much of the computing — rather than just CPUs — Titan requires only 9 megawatts of power.

Titan represents a step towards even faster 'exascale,' computing.
Titan represents a step towards even faster ‘exascale,’ computing.

Titan is five times as efficient as Jaguar, the 2.3-petaflop computer it replaced at Oak Ridge. That efficiency comes thanks to an idea called ‘heterogeneous computing,’ says Buddy Bland, project director for the Oak Ridge Leadership Computing Facility.

“If this were a machine of the same power and it were using CPUs it would be using about 30 megawatts of power, or about $30 million a year,” says Bland. “So heterogeneous computing really gives us a lot more bang for the buck.”

That’s because GPUs rely on the parallel computing technology long prized by supercomputer engineers. In order to render virtual battlefields or imaginary dragons for video game enthusiasts, GPUs hustle through a number of tasks at the same time, rather than bouncing quickly from one task to another, as CPUs do.

It turns out that’s a very efficient way to do computing, says Bronson Messer, acting group leader for scientific computing at the Oak Ridge Leadership Computing Facility.

“The kind of physical things that happen in a game, it turns out those things happen in nature as well,” says Messer, who admits to knowing his way around a game controller. “These are exactly the kinds of problems we’re trying to solve in a lot of scientific questions, from combustion to climate.”

Some assembly required: Titan contains more than 18,000 GPUs.
Some assembly required: Titan contains more than 18,000 GPUs.

The result is a sort of synergy between gaming and scientific research, with the tens of millions of consumers who rely on GPUs to power their games paying for research on a scale that the super-community could never afford on its own.

Yet the work done by those researchers is increasingly critical. Bland sees the simulations run by powerful machines such as Titan as playing an increasingly important role in scientific research. Titan is an open-science system, which means it can be used by researchers from academia, government labs, and private companies to model physical and biological systems ranging from the earth’s climate to the way engines burn fuel.

More powerful machines are coming. Titan – and its 18,688 GPUs — are a step forward on the path towards a concept Bland calls exascale computing. Titan can generate 20 thousand trillion flops. Exascale machines, by contrast, will generate one million trillion flops.

The U.S. Department of Energy would like to hit that mark by the end of the decade using just 20 megawatts of power. That’s a little more than twice what Titan consumes now.

Al Enger might want to start charging that power drill now.

Photos: Oak Ridge National Laboratory