The recent news and industry reaction regarding Intel’s forthcoming “Many Integrated Core” (MIC) accelerator has been interesting to watch. It appears Intel, like NVIDIA and AMD, has now concluded that hybrid architectures are the proper response to the growing power constraints of high performance computing.

While I agree with this, some of the discussions around programming the upcoming MIC chips leave me scratching my head – particularly the notion that, because MIC runs the x86 instruction set, there’s no need to change your existing code, and your port will come for free.

Power is the Problem

The technology underpinnings responsible for the move toward hybrid computing are pretty compelling, driven by the huge inflection point we experienced in the previous decade. Moore’s Law is alive and well, continuing to dish up more and more transistors per square mm. But Dennard Scaling is not.

We can no longer reduce voltage in proportion to transistor size, so the energy per operation is no longer dropping fast enough to compensate for the increased density. The result is that processors are now totally constrained by power. And it’s getting exponentially worse with subsequent generations of integrated circuits!

Circuit performance per watt is still improving, but now at closer to 20 percent per year instead of the almost 70 percent per year we used to enjoy. So how can we continue to improve performance anywhere close to historic rates, and achieve exascale computing by the end of this decade? Since the underlying technology is going to fall far short of the improvements we need, our only hope is to dramatically reduce the overhead per operation.

Hybrid is the Answer

This is where hybrid architectures come in. NVIDIA GPUs implement hundreds of simple, power-efficient cores that are optimized for high throughput on parallel workloads. Multicore x86 processors implement a handful of complex cores that are optimized for fast single-thread performance, but take many times more energy per operation.

“Since you can’t optimize a core for
both energy-efficiency and fast
single-thread performance, the hybrid
architecture allows us to concentrate
on making the GPU cores more and more
energy efficient, while relying on the
CPU cores for serial performance.”

To improve application performance per watt, we have to shift most of the work to the throughput-optimized cores, and just use the fast (but less efficient) CPU cores for the residual serial work. This is a hybrid architecture. Since you can’t optimize a core for both energy-efficiency and fast single-thread performance, the hybrid architecture allows us to concentrate on making the GPU cores more and more energy efficient, while relying on the CPU cores for serial performance.

Intel has announced a similar approach with MIC. They don’t really have the equivalent of a throughput-optimized GPU core, but were able to go back to a 15+ year-old Pentium design to get a simpler processor core, and then marry it with a wide vector unit to get higher flops per watt than can be achieved by Xeon processors.

So far, so good. But, I’m perplexed when I hear some people say that there’s no need to change your existing code to run on MIC because it uses the x86 instruction set. Just recompile with the –mmic flag, and your existing MPI or OpenMP code will run natively on the MIC processor! (In other words, ignore the Xeon, and just use the MIC chip as a big multi-core processor.)

Native Mode Complications

Functionally, a simple recompile may work, but I’m convinced it’s not practical for most HPC applications and doesn’t reflect the approach most people will need to take to get good performance on their MIC systems.

“A simple recompile may work, but
I’m convinced it’s not practical for
most HPC applications and doesn’t
reflect the approach most people will
need to take to get good performance
on their MIC systems.”

The idea of running flat MPI code (one rank per core) on a multi-node MIC system seems quite problematic. Whatever memory sits on the MIC PCIe card will be shared by more than 50 cores, leading to very small memory per core. From what I know of the MPI communication stack, that won’t leave much memory for the actual data – certainly far below the traditional 1-2 GB/core most HPC apps want.  And 50+ cores all trying to send messages through the system interconnect NIC seems like a recipe for a network logjam. The other concern is the Amdahl’s Law bottleneck resulting from executing all the per-rank serial code on a lower-performance, Pentium-class scalar core.

The OpenMP approach seems only slightly better. You’d still have the very small per-core memory and the Amdahl’s Law bottleneck, but at least you’d have fewer threads trying to send messages out the NIC.  Perhaps the biggest issue with this approach is that existing OpenMP codes, written for multi-core CPUs, are unlikely to have enough parallelism exposed to profitably occupy over 50 vector cores.

What About MIC Performance?

So far, the discussions of MIC programming have avoided confronting these issues by excluding any talk of performance.

We’ve seen scaling charts for MIC that show performance improving as more cores are used, but there is no absolute performance shown.  And the “scaling” results are literally for a single chip (not really scaling at all in the HPC sense). Looks eerily similar to the original Larrabee GPU charts from four years back.

To be fair, Knights Ferry is a pre-production prototype and thus performance is not supposed to be competitive. But, it just doesn’t make sense to talk about ease of programming in the absence of any performance considerations.

The whole point of an accelerator is to accelerate! What programming effort will be necessary on MIC to actually get good performance?

No “Magic” Compiler

The reality is that there is no such thing as a “magic” compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today’s programmers from the hard work of preparing their applications for the future.

With clock rates stalled, all future performance increases must come from increased parallelism, and power constraints will actually cause us to use simpler processors at lower clock rates for the majority of our work, further exacerbating this issue.

“The reality is that there is no such
thing as a “magic” compiler that will
automatically parallelize your code.”

At the high end, an exaflop computer running at about 1 GHz will require approximately one billion-way parallelism, but the same logic will drive up required parallelism at all system scales. This means that all HPC codes will need to be cast as throughput problems with massive numbers of parallel threads. Exploiting locality will also become ever more important as the relative cost of data movement versus computation continues to rise. This will have a significant impact on the algorithms and data layouts used to solve many science problems, and is a fundamental issue not tied to any one processor architecture.

Directives: Performance + Portability

Portability across machines is very important for application developers, and directives are a great way to express parallelism in a portable manner. The programmer focuses on exposing the parallelism, while the compiler and runtime focus on mapping it to the underlying hardware (perhaps with the help of auto-tuners). The new OpenACC standard allows users to express locality as well as parallelism, and is particularly well suited to today’s emerging hybrid architectures.

Existing OpenMP codes can also be taken forward, but will require some additional work.  Unfortunately, most OpenMP codes today apply the parallel directives to inner loops, which is appropriate for exploiting modest parallelism across a small number of cores. In order to run well on future machines with much more on-node parallelism, however, the directives need to be raised up in the call tree to expose much greater amounts of parallelism.

No Free Lunch

This will take effort, but it’s work that makes the applications inherently better suited for future architectures. Initial experience tuning codes for the new NVIDIA GPU-accelerated Titan supercomputer at Oak Ridge National Laboratory has been very positive, providing significant acceleration on key scientific codes.

Using OpenACC to express the parallelism and locality, as code was optimized for GPUs, the same code now ran significantly faster on vanilla multicore CPU systems. Tuning HPC codes for accelerators is real work, but it is work that will pay off across machine types and especially on future machines with increased levels of parallelism.

Which brings me back to the topic of programming for the upcoming MIC processors.

It’s clear to me that hybrid architectures make increasing sense in our power-constrained future, and Intel’s MIC effort shows they think so too. The upcoming Knights Corner processor will reportedly look much like today’s Fermi GPUs: power-efficient accelerators attached to an x86 CPU via PCIe. Programming the two architectures should be very similar: structure applications to expose parallelism and locality, and express via directives; use the multi-core CPUs for serial code, and execute the parallel kernels on the accelerator. The hope that unmodified HPC applications will work well on MIC with just a recompile is not really credible, nor is talking about ease of programming without consideration of performance.

There is no free lunch. Programmers will need to put in some effort to structure their applications for hybrid architectures. But that work will pay off handsomely for today’s, and especially tomorrow’s, HPC systems.

Get Started Today

It remains to be seen how Intel MIC will perform when it eventually arrives. But why wait? Better to get ahead of the game by starting down the hybrid multicore path now.

You can start today with NVIDIA GPUs, and you’ll be that much further ahead regardless of which processor architecture you ultimately choose.

If you currently have access to MIC chips and have been testing real applications, I would love to hear from you. Post a comment below sharing your experiences and results. I’m also interested in your thoughts on the move to hybrid multicore architectures and how we’ll need to program them.

  • http://www.facebook.com/profile.php?id=1039580144 Chris UberTiny Adams

    Nice read, thanks!

  • Xclusive Technology

    I have recently gotten into crunching numbers for BOINC (freeDC). I am not running any GPU’s, and recently began to learn about the Intel MIC, and have decided to wait and see where the technology goes. I am very curious about these things. I’m waiting to see some hint of a retail packaging. I cannot even seem to find out if they will in fact be released only in a standard CPU package (LGA) or if there will be a PCI-e version available like I hope.
    Then there’s the whole issue of the developers of BOINC incorporating proper support for this hardware.
    If the MIC is released in a PCI-e package I will definitely be in the market for one.
    I’m interested in the performance aspect vs. GPU setups, and BOINC projects are a great way to test this kind of thing.
    -Dave

  • Steve Scott

    Thanks for your comment.  We’re looking forward to being able to do head-to-head performance comparisons as well!

  • http://twitter.com/ken_obrien Kenneth O’Brien

    This is extremely relevant to my research. Good read, thanks. :)

  • http://www.facebook.com/tiziano.diamanti Tiziano Diamanti

    This is exactly what I thought. Makes
    so sense to have easy compilation if the performance is not there. My
    suspicion is that with MIC they will have to use TIBB or directives
    or even intrinsics, so goodbye easy compiling. At that point we will
    see  what performance MIC will have, but x86 compatibility will
    not be advantage at that point, maybe actually a drawback, a waste of
    silicon that uses power for something not useful

  • Steve Scott

    Right, what matters is the work it will take to actually achieve good acceleration on real codes. I think directives will be the most common path. Binary compatiility doesn’t really provide an advantage in HPC, where we all recompile our codes for every new platform. The real challenge we’ll all face is preparing codes for a future where computers only get wider, not faster.

  • jeffsci

    Intel MIC has no problem with Fortran OpenMP code or making optimized library calls (e.g. LAPACK, BLAS or FFT) from portable source code (that is, not CUDA).  None of these three features is supported by NVIDIA Fermi.

    NVIDIA doesn’t have their own Fortran compiler and the PGI GPU compilers have trouble with simple loop code.  OpenACC does not support native library calls such as CUBLAS and PGI claims that current-generation GPUs cannot do this in general (http://www.pgroup.com/resources/accel.htm#functions).  For the majority of scientific applications that make smart use of vendor-optimized libraries, Intel MIC has a vastly superior development environment.

    In addition to node-level features like OpenMP and libraries, one must consider distributed memory parallelism using models such as MPI.  How does one run an MPI, UPC or CAF code on NVIDIA Fermi?  Intel MIC runs MPI applications within the card, between the card and host and between nodes.  NVIDIA only supports the offload usage, which is not be appropriate for many algorithms.

    And to be clear, Intel MIC arrived months ago.  Just because NVIDIA employees can’t get their hands on one does not mean programmers around the world aren’t already using them to experiment with all of the aforementioned features.

  • Roy_Kim
  • Steve Scott

    I appreciate the post, Jeff.  I’m sure the MIC programming environment will have some nice features.  You may be under some misimpressions about NVIDIA’s programming environment, though.  OpenACC does support full interoperatbility with CUDA code such as CUBLAS.  And of course, you can call GPU-accelerated libraries from portable code.

    It’s actually scalable, distributed-memory versions of code that I’m particularly thinking of when I advocate for a true hybrid programming model. I don’t think internode MPI message passing from 50+ MIC cores will be an effective approach.

    I’m well aware, of course, that Knights Ferry is available as a software development kit to select partners under NDA. I look forward to the day we can do direct and open comparisons of the programming experience and performance on production systems. Intel’s got a lot of talented engineers, and I expect a good programming environment for MIC.  Of course, our next generation Kepler GPU will be shipping later this year, and you can expect some impressive enhancements of both the programming environment and performance on that platform. We’ll be able to have a grounded conversation about programming for performance at that time.

  • jeffsci

     That’s great that OpenACC supports CUBLAS calls inside of offload regions.  PGI’s website did not give me the right information about this the last time I looked.

    There are, of course, shortcomings of MIC, but I feel that it is necessary to be a Devil’s Advocate against all vendors in the face of multi-million dollar marketing campaigns by Intel, NVIDIA, AMD, Cray, IBM, etc. to convince they world that they are selling the greatest product for programmers since Mountain Dew.  I’m glad you’re willing to be a good sport about this.

    One aspect of NVIDIA’s HPC offerings that is discussed in expert circles as a big win is that CUDA has a scalable memory model, specifically, strict coherence and strong coherence have never been part of the plan.  I would be interested in seeing some back-and-forth between NVIDIA and Intel over the scalability of their respective accelerator memory models.  It’s one thing to debate who has better pragmas (OpenMP vs. OpenACC) but syntax inconsequential without corresponding scalability when it matters (including nasty algorithms like graph partitioning).

    Perhaps it’s worth discussing how the CUDA memory model compares in scalability (within threaded programming models) to cache coherence with transactional memory (which is real in IBM Blue Gene/Q and stated by Intel to be available in future x86 CPUs).  I apologize if this topic has been addressed already, but I don’t read anyone’s blog religiously.

  • Michael Simmons

    If Intel is putting 64MB of on package ultrawide memory on Haswell then to me its probable that they will do the same with MIC.
    see http://semiaccurate.com/2012/04/02/haswells-gpu-prowess-is-due-to-crystalwell/

    Is Nvidia considering this for future GPUs?

  • http://twitter.com/codedivine Rahul

    I do agree with you that the future will see hybrid architectures. As a person on the software and tool development side, my primary concern is whether the major vendors will all agree upon and support some standards. I am primarily investing in OpenCL development. While OpenCL codes do require tuning going from one platform to another, at least thats not a complete rewrite.

    OpenCL is still maturing and I hope that OpenCL standards develops a little faster. The full capabilities of architectures like Fermi are not really exposed through OpenCL. I am also hoping a lower-level interface like PTX or CAL is also standardized to be an easier target for compilers (particularly JIT compilers for dynamic languages).

    I hope that Intel MIC (and for that matter Ivy Bridge and Haswell GPU) will also support OpenCL instead of coming up with yet another API. I have never seen OpenCL mentioned in any Intel presentations about MIC, Ivy Bridge or Haswell.

  • Hoang Vu Nguyen

    Isn’t it pathetic that Nvidia has to resort to propaganda to stop customers from waiting for an unreleased Intel product? It’s just because their Kepler for HPC is still nowhere to be seen.

  • http://www.facebook.com/tiziano.diamanti Tiziano Diamanti

    OpenCL is ok for small-medium projects. For 200000+ lines of code, is not. Managers will shoot you if you mention the word “rewrite”. Directives are the only solution in those cases.

  • Alan Dang

    One of the challenges is that only a subset of HPC users have the technical resources and time to write custom code.  Those are the groups like Oak Ridge or large engineering departments at major universities.  The “desktop supercomputer” market where budgets are <$50k is where MIC seems to be most attractive.  We are big users of LS-DYNA but we don't have the time/resources to port software over to new platforms.  All we really care about is what will give us the fastest performance for our FE simulations.  If MIC-in-a-box is faster than a cluster of Xeon's simply by running OpenMP (even if it's very inefficient), we'd still go with it.  For groups like mine that use tools like LS-DYNA, MIC isn't competing with Tesla.  It just has to compete with a cluster of Xeons and Tesla never gets a chance for the bid…

  • Steve Scott

    Right, I agree that most HPC users run ISV codes. Just a few years back, GPU computing was mostly roll-your-own software, but the momentum for commercial application availability has been quite strong lately. Since 2009, several major CAE ISVs have announced product availability on NVIDIA’s Fermi architecture, and there will be existing and new product updates around Kepler this year.The list includes the market leaders, including ANSYS, SIMULIA, MSC Software, Altair, Autodesk and LSTC, as well as an emerging class of new ISVs who’ve developed their software around GPU technology from the beginning, such as IMPETUS, Prometech, FluiDyna, Vratis, and others.Every relevant ISV in the CAE market has GPU technology undergoing some stage of product evaluation, and the motivation is simple: GPUs have become a proven strategy for them to deliver compelling performance gains on their production solutions. If it’s been a while, it’s worth taking another look at the ISV solutions available today that have been GPU-accelerated, and keep an eye out for the new ISV announcements around Kepler later this year.

  • http://pulse.yahoo.com/_6KWCRTVOFTJI35XPLZ7UP3ZVBI Paul

    This image from semiaccurate.com suggests OpenCL support for Ivy Bridge GPU.

  • Дмитрий Вожегов

    Sorry for my English! 
    If do not talk about deep past (1960-1980), you must know, that in 90 years Intel and AMD concluded, that in addition to scalar unit, the processor must have a vector units. Intel, AMD is a hybrid processor(computer system), because it contains a scalar and vector unit. For twenty years, all the people who use and design of these processors understand that the best computer system - a hybrid (for example “Scalar unit + SSE”). Even before the creation of “Nvidia”, Intel and AMD have done a hybrid computing system and the experience of their use and their design is much more than the “Nvidia”. Their compilers generate not only code for scalar unit, but the “SSE” code, ie compile hybrid code(scalar and vector). 
    Their computer systems before the appearance of “Nvidia” were hybrids, their compilers are able to generate a hybrid code for many years, but  you’re trying to show that the hybrid system – is “x86″ + “GPU”(but if you remove the “GPU”, the system would still be a hybrid). The hybrid system is not the future, it is already past(1970-1980). Do you want to give the past achievements of others as you own?
    Much of what you describe – it’s “forgotten past”, which Intel has already passed several times.
    Intel has much more experience in the development of hybrid computer systems and compilers for them than “Nvidia”. Problems appear when resolve any problems. There are no problems only for those who do nothing. Let us not underestimate the intel. I think they will cope with this task.

  • Steve Scott

    You make a good point that vector architectures are another form of hybrid.  I spent many years as chief architect at Cray, so I know vector architectures well.

    The growing power constraints that we’re facing in the future, however, demand a different sort of hybrid. We need most cores to be optimized for extreme power efficiency. Vector architectures can fill this role, but only with very simple scalar cores.  This is what Intel is doing with MIC. 

    You still need a complex, aggressive out-of-order core to do the serial work as fast as posisble, but you can’t couple one of those to every vector unit or you will compromise their power efficiency.  You need lots of power-efficient cores, and only a few serial-optimized cores.

    This is the new type of hybrid processor, and goes beyond the scalar/vector form of past hybrid architectures.  NVIDIA, Intel, AMD, and to some extent IBM, have all articulated this as the future of HPC.

  • Дмитрий Вожегов

    Thanks for the answer! I just wanted to say, that you counterpose “x86″ and “x86 + GPU”, saying that the second is a hybrid system but first it is complex core for serial work. But “x86″ = “SISD” + “SIMD” (20 years), “x86 + GPU” = “SISD” + “SIMD”. Both of these hybrid systems. And great “x86″ compilers for 20 years to compile the a great hybrid code for such systems. Intel understands very well that such a hybrid system and how it works and how to make it even better and more practical.
    It is for the greater power efficiency and productivity in the x86 have been added “SIMD” units, and all this time, these devices improved and grew, because everyone understood their significance. Nvidia has allowed for a little money to get “SIMD” unit is many times more powerful than those in the x86 core (now even more correct to say that GPU is “many (SIMD)” or “MIMD” system). But all this time, Intel has worked and accumulated experience in the “SISD” and “SIMD” and “MISD” and “MIMD” systems. And as practice shows their “Magic compiler” compiles good code for all these systems, and I think the “MIC” is no exception.
    It is only my opinion but I think that intel is enough experience to solve problems that you mention, and they well understand them  too.

  • http://twitter.com/rmcsoftwareinc RMC Software Inc

    It does seem like hybrid architecture parallel computing is the only way forward.

  • jkflipflop98

    Never count out Intel. 

  • http://www.facebook.com/strikekapil Kapil Mathur

    Post has reflected the area of concern for using the accelerators to its optimum performance in scaling the applications. Nvidia has proven its stake to a good extend, hope the same from Intel MIC.

  • http://profile.yahoo.com/OIXM4YVTYW2PNCUYWM4JHRCCLQ Brayan

    Hey  I know this may not be the best forum for this but this is a recent one – so here goes — couuld you please add support for samsungs UE46ES8000 -I have raised a ticket 
    #120501-000105- thanks :)

  • Stephen Thomas

    For MPI, it is doubtful that all the processes running on the MIC will communicate through the NIC.
    Rather, because of the locality and surface to volume effect of data decomposition, only a reasonable subset of the cores will go through the NIC and the majority of the messages will end up being memory-to-memory copies. For example, if we have a rectangular domain, and 8×8 cores are assigned a rectangular sub-domain, or even each of the 8×8 = 64 cores are assigned a sub-domain (MPI rank), then worst case scenario is that 32 of the 64 cores communicate one boundary through the NIC. There are far more internal exchanges in this scenario.

    The best approach is hybrid, where there is an MPI/OpenMP or MPI/pThreads combination. Perhaps 4 or 8 MPI tasks in combination with 16 or 8 large grain pThreads. This will decrease the number of hits on the NIC (fewer subdomains) and fewer memory to memory copies. Then the burden shifts to thread start-up and scheduling.

    Sure, this is not an out of the box, compile and go solution, but one can easily imagine the hybrid scenario.

  • Andreas Schäfer

    Thanks for stressing that “there is no free lunch”! When I teach parallel programming (no matter if multi-cores, MPI or GPUs), I do always include cases in which it’s not trivial to achieve a speedup at all and challenge the students to explain what’s happening. Too many think that parallelizing code was as easy as” #pragma omp parallel for”…