“GPUS ARE ONLY UP TO 14 TIMES FASTER THAN CPUS” SAYS INTEL

It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. In fact in all the 26 years I’ve been in this industry, I can’t recall another time I’ve seen a company promote competitive benchmarks that are an order of magnitude slower.

The landmark event took place a few hours ago at the International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, interestingly enough, the same event where our Chief Scientist Bill Dally is receiving the prestigious 2010 Eckert-Mauchly Award for his pioneering work in architecture for parallel computing.

At this event, Intel presented a technical paper where they showed that application kernels run up to 14 times faster on a NVIDIA GeForce GTX 280 as compared with an Intel Core i7 960. Many of you will know, this is our previous generation GPU, and we believe the codes that were run on the GTX 280 were run right out-of-the-box, without any optimization. In fact, it’s actually unclear from the technical paper what codes were run and how they were compared between the GPU and CPU. It wouldn’t be the first time the industry has seen Intel using these types of claims with benchmarks.

The paper is called “Debunking the 100x GPU vs CPU Myth” and it is indeed true that not *all* applications can see this kind of speed up, some just have to make do with an order of magnitude performance increase. But, 100X speed ups, and beyond, have been seen by hundreds of developers. Below are just a few examples that can be found on CUDA Zone, of other developers that have achieved speed ups of more than 100x in their applications.

Developer

Speed Up

Reference

Massachusetts

General Hospital

300x

http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-17-22-20178

University of Rochester

160x

http://cyberaide.googlecode.com/svn/trunk/papers/08-cuda-biostat/vonLaszewski-08-cuda-biostat.pdf

University of  Amsterdam

150x

http://arxiv.org/PS_cache/arxiv/pdf/0709/0709.3225v1.pdf

Harvard University

130x

http://www.springerlink.com/content/u1704254764133t5/?p=c5eead9af73340e58a313d95581cfd40&pi=49

University of Pennsylvania

130x

http://ic.ese.upenn.edu/abstracts/spice_fpl2009.html

Nanyang Tech, Singapore

130x

http://www.opticsinfobase.org/abstract.cfm?URI=oe-17-25-23147

University of  Illinois

125x

http://www.nvidia.com/object/cuda_apps_flash_new.html#state=detailsOpen;aid=c24dcc0f-c60c-45f9-8d57-588e9460a58f

Boise State

100x

http://coen.boisestate.edu/senocak/files/BSU_CUDA_Res_v5.pdf

Florida Atlantic University

100x

http://portal.acm.org/citation.cfm?id=1730836.1730839&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264

Cambridge University

100x

http://www.wbic.cam.ac.uk/~rea1/research/AIRWC.pdf

The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements. Undergraduate students learning parallel programming at M.I.T. disputed this when they looked at the performance increase they could get from different processor types and compared this with the amount of time they needed to spend in re-writing their code. According to them, for the same investment of time as coding for a CPU, they could get more than 35x the performance from a GPU. Despite substantial investments in parallel computing tools and libraries, efficient multi-core optimization remains in the realm of experts like those Intel recruited for its analysis. In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs.

At the end of the day, the key thing that matters is what the industry experts and the development community are saying and, overwhelmingly, these developers are voting by porting their applications to GPUs.

Similar Stories

  • http://gpgpu.org/2010/05/30/ibm-rc24982 Edison

    http://gpgpu.org/2010/05/30/ibm-rc24982
    “Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!”

  • James

    Sounds like AMD FUSION will be a winning platform. NVIDIA can use the CUDA developer platform to provide tools and expertise to developers.

  • http://nexiwave.com Ben Jiang

    GPU is very different architecture. It’s designed for application that can benefit from massive multi-threading. It really depends on the task.

  • http://www.programmerfish.com Salman Ul Haq

    It definitely is lame trying to undermine a performance improvement of 14X. It is an epic display of Intel’s desperation.
    On another note, that abstract link quoted in the first comment is worth a read!

  • Pete

    Enjoy the carefully-picked cherries. Try not to bite the pits.

  • Andrew Humber

    Hi Edison – thanks for the link, it was an interesting read.
    It’s worth noting that the dataset used here fits in cache which eliminates the weakness of a CPU.
    Once the data spills from CPU cache – which is common in applications – the CPU perf would degrade very fast.
    This is a very specific application type – as admitted by the authors.
    Andrew Humber
    NVIDIA PR

  • Andrew Humber

    Hi James – thanks for the comment.
    These are our thoughts on Fusion. Fundamentally, putting a CPU and GPU on the same chip is a low-end processor strategy.
    For high performance computing (HPC), Intel and AMD maximize CPU cores to get the best performance. Similarly, NVIDIA maximizes the number of GPU cores and clock frequencies in Tesla to get the best performance.
    To put a CPU and GPU on the same chip, one has to either put a small CPU + medium-size GPU or medium-size CPU + small GPU. This means you have lower performance from both the CPU and the GPU.
    This is an excellent strategy for the handheld and tablet market, where the system power is most important. This is why NVIDIA builds the Tegra SoC with ARM CPUs and NVIDIA GPUs.
    For the HPC market, the most important things are performance, power and programmability. Supercomputing and HPC users, in fact anyone who needs to solve their problem as fast as possible within a reasonable budget and isn’t looking just for adequate performance, they are looking for the highest performance possible in a reasonable power envelope.
    I’m afraid I am not quite sure what you mean about NVIDIA providing a development platform. ATI has only released OpenCL – and while relatively incomplete for HPC at the moment – their direction appears to be solely on this language.
    Andrew Humber
    NVIDIA PR

  • Andrew Humber

    Hi Ben,
    You are of course right which is why we are clear to point out that not all applications will deliver 100x or more performance improvements with GPUs – like you say, GPUs excel at accelerating apps that lend themselves to execution in parallel. While this used to mean “embarrassingly parallel”, modern GPUs can now efficiently execute algorithms previously only on possible on CPUs.
    Ray tracing is a great example. With irregular data access and sparse matrix operations, common wisdom from CPU vendors was that the GPU would not efficiently run ray tracing. With both our previous generation and current Fermi GPUs, thinking through how to make an application operate in parallel can yield big performance speedups.
    In the case of Intel’s paper, we believe the codes used for this paper were tasks that were hand-picked and optimized for CPUs, and not for GPUs, which is why even a 2x performance improvement using GPUs is significant – this number could be much bigger with optimization.
    I suspect Intel didn’t approach their research by pulling out their best GPU optimization experts to see just how fast the GPU could be pushed 🙂
    Thanks for the comment and for following nTersect.

  • Aydin

    I don’t understand what is application kernel. Is it OS kernel? Is it mean that next generation operating systems could be based on GPU? I had this question for a long time… Is it possible to write or compile an OS kernel for a GPU? and I couldn’t find a good answer about it

  • http://nexiwave.com Ben Jiang

    That’s possible. Tuning is the key. During our GPU implementation, there is a huge coding paradigm shift: algorithms were re-written many times for performance. Not sure if Intel bothered doing that;).
    What we noticed was: most CPU code, in our previous system at least, was written without much consideration of parallel processing. I suppose there can be a debate about if this is because CPU has been traditionally single-task focused, or this is just natural way of human thinking. In any case, those undergrad’s level courses are really much needed for this kind of shift to happen naturally. Solid math skill is also critical.
    As far as performance goes, the current GPU (CUDA 3.0, at least) wins hands down. No doubt about it…

  • muckmoe

    not like nvidia hand picks thier benchmarks.

  • Serge

    Albeit interesting that Intel is noting this, we must remember that a CPU does work one thing at a time, finishing as fast as possible and start the next task. A GPU does all at the same time. Obviously a GPU is much faster, if the app running has been properly coded for that environment.
    Another thing, Andrew Humber said: “These are our thoughts on Fusion. Fundamentally, putting a CPU and GPU on the same chip is a low-end processor strategy.” And I somewhat agree, but we must take into consideration that “low-end” here is what for the mainstream market is “med-high level”.
    For example, let’s take Fusion for a while, taking a dual-core CPU and a mid-range GPU (let us say something running 1GB GDDR3 and 128bit bus) in the same envelope give a high performance keeping power requirements lower than single-chip solutions. This application is perfect for the mobile environment, most noticeably, thin’n’light laptops, where power and heat constrains are monumental problems for engineering teams to tackle. Some goes when applied to Intel’s Arrandale and Ivy Bridge architectures. Combining both CPU and GPU on the same die, saves power and heat while increasing performance (compared to single-chip solutions). On the mainstream market, more are the programs running on single or even dual threads only, therefore even a quad-core sees little to no performance edge in most cases over a dual-core CPU. In this case, the full power of the GPU, this being either nVIDIA or ATi, cannot be exploited, or hasn’t been exploited yet.
    My point is that for the mainstream market, even the “single-tasked” CPU, performs better than the GPU on most cases, as few apps are running on CUDA or OpenCL compared as how many are runningon x86 and “classical” coding.
    Now, for the ultra-high-end market that HPC is, the CPU does come out slower, as those multi-point softwares are better coded and can take advantage of the highly parallel GPU architecture, in contrast of the “more-single-tasked” CPU. In this other scenario, the GPU, even Intel’s IGP, show a performance edge over the CPU due to the special nature of the code being ran. It is obvious that we are moving towards more parallelized computing in the future, both mainstream market and HPC, that is why OpenCL was launched, so it could be easier to push it, same as Flash 10.1 which uses accelerated performance via the GPU, and this are the early days only.
    Andrew Humber said: “This is an excellent strategy for the handheld and tablet market, where the system power is most important. This is why NVIDIA builds the Tegra SoC with ARM CPUs and NVIDIA GPUs.”
    There I agree that power constrains are much more difficult to beat, that is why the SoC was born, to help reduce the requirements, very much like what CPU manufacturers are trying to do in order to help mobile devices, such as notebooks, obtain a better autonomy. I am quite excited to see a Tegra2 running smartphone this year in mass production. That dual Cortex-A9 is very appealing and performance seems to be excellent.

  • Andrew Humber

    Hi Aydin
    Thanks for the comment. We linked to a good wiki page in the post about kernels and their exact definition – am sure the combined brilliance of the contributors to that page can do a better job of an *exact* definition of a kernel than I can 🙂
    But to answer your other questions, it’s probably not likely that an OS would be solely based on a GPU. At NVIDIA, we believe in co-processing or hybrid computing if you will, the use of GPUs and CPUs together, each doing what it does best.
    CPUs excel at sequential tasks like running an OS where each interaction could be different from the last, and GPUs excel at parallel tasks, like crunching through large, computationally intensive data sets that require the same calculation to be done over and over again.
    By using the two processors together, you enable the GPU to do the heavy-lifting, the kind of computational problems that leave you staring at an hour glass for hours and free up the CPU so that it can respond quickly to user interactions/requests with the OS. There are a few areas where GPUs can help accelerate applications that are integral to the OS, such as video encoding (native to some OS now), Aero (windows) acceleration, Flash, Silverlight acceleration etc.
    An over simplified response I know, but I hope this makes sense.
    Thanks again for the comment and following us on nTersect
    Andrew Humber
    NVIDIA PR

  • Bogdan

    I respect NVIDIA for making great graphic cards and I’m a loyal customer but in this media fight with Intel I can’t see the point. First, buy the license and make your own CPU’s. AMD did that and came out with something better, x86-64.
    Second, GPU’s are clearly more powerful than CPU’s but there is no software to use them (the industry is still struggling with 64 bit transition). Badaboom is a $29.99 application (for Windows only, I might add) how many customers do you have for that? In my opinion you should encourage developers to implement CUDA in their applications and leave Intel fight Moore’s law.
    All the best !

  • AndrewL

    “In the case of Intel’s paper, we believe the codes used for this paper were tasks that were hand-picked and optimized for CPUs, and not for GPUs, which is why even a 2x performance improvement using GPUs is significant – this number could be much bigger with optimization.”
    Stop perpetuating this ridiculous claim. The guys who did this work are some of the best in the world (in both CPU and GPU work) and they indeed *began* by improving significantly on the state of the art GPU code.
    This isn’t some random marketing presentation, this is published, peer-reviewed paper (http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=94608761&CFTOKEN=50783980&ret=1) so if you claim that you can write faster GPU code please go ahead and do it and publish your own paper. Furthermore, criticizing them for using pre-Fermi hardware is silly since that hardware was not even available when the paper was published, let alone when it was written!
    Taking cheap shots and making unsubstantiated claims about people doing actual work does not endear yourself to the developer and research communities. Please, show a little more professionalism and do your research before taking shots at other people’s work.

  • Craig

    Here’s a great paper that helps debunk this myth. It comes down to how optimized your base-case is!
    http://www.usenix.org/event/hotpar10/tech/full_papers/Vuduc.pdf

  • Mohamed

    Successful companies focus on the users and not the competition.

  • Igor

    On the one hand, any acceleration over 10x if achieved by 1 GPU compared to one modern multi-core CPU is great.
    On the other hand numbers like “300x speedup” (1st line in the table above) ARE obviously BOGUS – they often compare single-thread on CPU or very outdated 2-core CPU with a fairly modern graphics card.

  • http://www.heise.de/newsticker/meldung/ISC-10-Intel-Larrabee-ist-da-1012723.html Many-Core

    When will you shift PhysX from CUDA to OpenCL? OpenCL is an open Standard and can be used by other GPUs, it would be great for Customers…
    By the way, Intel has allready his HPC-Chip, so you get competition on this, too
    http://www.heise.de/newsticker/meldung/ISC-10-Intel-Larrabee-ist-da-1012723.html

  • http://mcx.sf.net fangq

    Look, Igor, the code for the “300x speedup” paper is freely available at http://mcx.sourceforge.net/cgi-bin/index.cgi?Download#Anonymous_SVN_Access , and the CPU counterpart is here: https://orbit.nmr.mgh.harvard.edu/wiki/index.cgi?tMCimg/README
    why not try it for your self and make a scientific judgment?

  • http://mcx.sf.net fangq

    I want to add that the processors under comparison in this paper were Intel Xeon 5120(64bit@1.86G) vs 8800GT, so it was fair as both have been outdated (by now).
    With Intel Xeon E5530@2.4G, the cpu code runs 2x faster than 5120, but with GTX470/480, the GPU code is 4x faster than 8800GT, that means 600x~700x faster than E5530@2.4G.
    I agree with you that in some publications, there are biases in their comparisons, but on the other hand, you have to admit, for certain specific applications such as Monte Carlo, there is a possibility that GPU can be magnitude faster.
    I would be glad to answer any questions about this work, but please don’t call it “bogus” if you have not finish reading it (well, you may see that I am one of the authors).

  • Anonymous

    I wonder if you guys have read the paper thoroughly or are you just claiming/judging based on what this article has to say.

  • LarryJ

    Pot meet kettle? You are the one who is providing “unsubstantiated claims” when you say that the guys at Intel performed expert hand-tuned code for the GPU used in this comparison.
    By the way, NV_Humber was not “criticizing” Intel for using pre-Fermi hardware, merely pointing out that this newly published article is using a prior gen card from NVIDIA. So if the GTX 280 has “ONLY 2.5X” performance advantage over Intel (with no real info from Intel about what they specifically did to “optimize” GPU code), then the performance advantage is all the more wider now for NVIDIA with GTX 480 on the market.
    On a final note, AndrewL is currently employed by Intel. Next time, have the grace and courtesy to identify yourself and your employer when posting on a competitor’s website.

  • AndrewL

    Have you read the paper? It is well-explained and referenced where the implementations come from, and as noted there do not appear to be any better numbers published to date. If you dispute that claim, simply find better published numbers or publish your own paper. The academic world thankfully does not operate on idle speculation… the claims in the paper are sufficient for the peer review of the paper, putting the burden of proof on the people disputing them. That’s how the system works, so I encourage you to go through the proper channels if you believe it to be in error. A blog post with blind criticism of published work is simply unprofessional.
    I used my real name because I do have the grace and courtesy to be open. That said this has nothing to do with Intel and I had no part in the work and do not work with the authors. I would ridicule any company for making similar PR-like attacks on published academic work. Leave the marketing to those folks and lets stick to the established academic process when it comes to papers.

  • AndrewL

    Certainly the GPU can be (much!) faster in some cases, but this is actually precisely the sort of comparison that the paper is trying to “debunk”… the GPU implementation of the MC simulation is not doing the same thing as the CPU implementation. By my reading, the MT atomic case (which produces a more modest speedup) is the most directly comparable and even then it does not produce identical results. Furthermore, according to the paper the CPU implementation is running dated research code (unclear how optimized it is, if at all) on a single core, using double-precision (and I imagine not using SSE at all), while the highly-tuned GPU implementation is using a different RNG, lots of parallelism/SIMD and single-precision. This is not to devalue the work in any way; it is obviously very useful! That said, it does not represent a fair comparison between the two hardware architectures (and doesn’t really aim to), which is the point that the originally referenced paper is making.

  • Igor

    Probably BOGUS wasn’t the best word to use – however this article clearly states on pg 10 “we compiled the CPU implementation, tMCimg, using the “-O3” option with the gcc compiler and double-precision computation on a single core of an Intel 64bit Xeon processor of 1.86GHz.”
    So the way I understand it single-threaded non-SIMD enabled double precision version on CPU was compared with single-precision fully threaded and carefully tuned GPU version. I know that G92 doesn’t have DP hardware support – but still it’s not a reasonable comparison.
    Don’t get me wrong – I understand that GPU will still be MUCH faster (50 times?) than fully parallelized SIMD-enabled carefully tuned same precision CPU version – but this article just didn’t compare apples to apples imho.
    And thanks for source code links (and BSD license); will look at it if/when I’ll have time (unfortunately I don’t work in research 🙁 ).

  • http://mcx.sf.net fangq

    I am glad that you guys have read the paper 🙂 and as you saw, there are more than one numbers concerning the speed-up, each one has its benefits and trade-offs. I want to follow up with some of my thoughts.
    First of all, the definition of “apples” really varies from person to person, and I believe that it also significantly differs from an engineering perspective from a computer-science one (which my paper was not made for). The criterion here is not “bit-to-bit identical”, rather, we concern “whether the computation suits its purposes”. Comparing two methods that do not give exactly the same answer does not mean the comparison is meaningless; analogously, using computer programs that unable to compute infinite series-expansion shall not mean that we can not use these programs to compare. The point of the paper is, when your goal is to get fluence over a few mm away from our source, non-atomic/single-precision/massively-parallel GPU kernels can give you meaningful solution and is much faster than a (widely used) single-threaded CPU code. All of these techniques (or approximations) are legitimate “truncations” as you would do when using a series-expansion solution.
    Secondly, the main point of the paper is in fact not GPU specific, although the peak speed-up does draw a lot attention to GPU. The massively parallel algorithm can be implemented for Larrabee as well (and of course, it also has atomic vs non-atomic issues on CPU). It was the massively-parallel vs. single-thread that the paper was trying to compare.
    Thirdly, if you look into the code itself, you will see that the core computation [1,2] is so simple (~40 lines of code without boundary reflection) that there isn’t much space for “specific optimization” for GPU. In fact, in order to get it working on GPU, I had to make some sacrifices. For example, the Logistic-Lattice RNG is almost twice slower than Multi-with-carry RNG as used in the CPU code; to enable reflection (CPU code does not support), I had to consume ~10 registers per thread, which limits my maximum thread-block. Despite that I have spent equal amount time profiling and optimizing the CPU code, I am pretty sure many people can do a better job, however, I will be very surprised if anyone can get the CPU code 2*#Core times faster than what it is at present (I don’t want this to sound like a challenge, treat it as an invitation for open-source collaboration, of course, you can also help me to optimize the GPU code [2] to compete 🙂 ).
    Lastly, I wrote an OpenCL version of my code, and would be glad to share if you consider this a fair way to compare. At this point, CPU backend only worked for ATI platforms, but my findings is similar to my paper. Please email me at fangq nmr.mgh.harvard.edu if you want to try the OpenCL version by yourself.
    [1] CPU code: http://is.gd/d88hg
    [2] GPU code: http://is.gd/d88jw

  • AndrewL

    Yup, I do completely agree with what you’re saying, and that’s why I said that your paper was good in it’s own right – i.e. practically solving the problem that you were trying to solve and in that light, the results are great! The point that the paper here is trying to make is indeed the inference that people wrongly draw from results such as yours: that a GPU is 300x faster than a CPU. That’s basically impossible given similarly optimized code for each platform given that in no single way is it that much more powerful. 10-50x is certainly possible though depending on the workload.
    Regarding optimizing the CPU code, you need to be using single-precision, SSE and multi-threading to even begin to compare. Given that you’ve done the legwork to parallelize the code for the GPU, it’s safe to assume that you can apply similar strategies to get good data-parallel scaling on the CPU as well. Thus on a quad-core processor with SSE you’re looking at more than 16x improvement in math throughput just from these changes alone (more if you use a hyper-threaded processor). Switching to a faster (and better!) random number generator than rand() and better trig functions than the C standard library may help too, depending on where you’re bottleneck is at this point (have you profiled your code to find the hot-spots?).
    I’d certainly expect a good OpenCL implementation to take advantage of a lot of this as well, so I’m surprised if you literally converted your CUDA version to OpenCL and didn’t get any speedup beyond your C code… of course the OpenCL implementations are pretty young at this point so there may still be compiler problems holding it back.
    In any case I’ll follow up over e-mail as you’ve captured my curiosity 🙂

  • http://mcx.sf.net fangq

    ok, now you have my ocl code and I hope you have some fun with it. Just to reiterate, the 300x speed up is NOT just about GPU/CPU, it is massively-parallel vs single-thread. The reason that I did not begin with an SSE/multi-thread CPU code is because such code did not exist when I wrote the paper.
    FYI, the bottleneck for the single-threaded CPU code is the math function, and the bottleneck for the GPU code is the register number.

  • Nutti

    ” But to answer your other questions, it’s probably not likely that an OS would be solely based on a GPU. At NVIDIA, we believe in co-processing or hybrid computing if you will, the use of GPUs and CPUs together, each doing what it does best.
    CPUs excel at sequential tasks like running an OS where each interaction could be different from the last, and GPUs excel at parallel tasks, like crunching through large, computationally intensive data sets that require the same calculation to be done over and over again. ”
    So if CPUs and GPUs do different things, like we all know, why are you comparing them at all? It’s like comparing dishwashers and washing machines by putting the dishes on the washing machine and the clothing in the dishwasher.

  • Aqeel Mahesri

    I find this blog post kind of embarrassing. Will NVIDIA PR please hire some grown-ups?
    I question the decision to let the marketing side even write the response to this paper. The paper itself is a solid piece of peer-reviewed research and is done quite fairly. It is certainly much more fair than the 300X speedups that Andy Keane is touting.
    The attack on Intel for using un-optimized code on GPUs suggests that Andy Keane didn’t even read the paper!
    The paper is hardly saying that GPUs perform poorly. It says that year old GPUs manage to outperform Intel’s latest chips by 2.5X-14X even after a lot of optimization work. That’s a pretty good result.

  • sea.bird

    Let’s do some simple computation to estimate the speed of the CPU program of Fang’s counterpart.
    The 8800GT has 112 Cores and works at 600MHz. The Tesla C1060 has 240 Cores and works at 1300MHz with 933GFlops single precision performance. Hence, 200GFlops for 8800GT is a reasonable estimation.
    Assume Fang’s GPU program is fully utilize the performance of 8800GT, and the GPU program is 300 times faster than the CPU program, then, the CPU program is only 0.67GFlops.
    As we know, Xeon 5120 works at 1.86GHz. The SSE of Xeon 5120 is 128bits (4 single precision floats). Hence, the top speed of one Xeon 5120 core could be 1.86*4 = 7.4GFlps. So, Fang’s CPU program only utilized 10% of one CPU core (5% or the Dual-Core Xeon 5120).

  • sea.bird

    I briefly studied Fang’s CPU code. The interesting part is result saving (from line 437 to 473), which saves result for each photon. For a simulation with 10^9 photon, the result file could easily reach hundred GB, which may need hours to store to hard-disk.

  • cpu-gpu

    A very important thing in research community is how to present the data. The Debunking paper is a good example. The 300X paper is on the other hand: a “not so good” example.
    Let’s use the 300X paper as an example.
    The author said “The point of the paper is, when your goal is to get fluence over a few mm away from our source, non-atomic/single-precision/massively-parallel GPU kernels can give you meaningful solution and is much faster than a (widely used) single-threaded CPU code.”
    Then, my question is: “did you ever optimize your “widely used” CPU code?” If a researcher could not write a high performance CPU code (maybe he don’t have time on his “widely used” CPU code), how can we believe he/she can write a high performance GPU code? Also, don’t only compare with your own old code, compare with code from others. For this paper, why not compare with MCML? The author can easily setup a MCML compatible geometry (there is a MCML version for time-resolved: “Time-resolved photon emission from layered turbid media”) to do the comparison.

  • http://mcx.sf.net fangq

    hi sea.bird, I am afraid that this assumption is incorrect. All photon weights are summed first into a 3D volume (float *gfield). Only the final fluence volume are saved after the calculation. For a 60x60x60 grid with single-precision fluence, the resulting file is only 0.8M (if you simulate multiple time-gates, you also need to multiply the time-gate numbers, but generally, this is less than 50).

  • http://mcx.sf.net fangq

    Well, the CPU code was originally written by an established researcher of the field over 10 years ago. So, I don’t feel terribly bad if the CPU code is not taking the full potential of theoretical throughput. Moreover, it is not realistic to expect a real-word computational code to have a throughput anywhere near the theoretical value, as it needs to do lots of math, lots of random memory IO, flow controls etc. I seriously doubt if you can improve the code to get the speed up to 50% of the theoretical throughput (single core), however, but nothing prevent you from trying. You can find the download link for the CPU code from my previous posts. But since I have my own OpenCL code and the SSE was enabled when running the OpenCL code on a CPU back-end, I can tell you I can only get to about 2x faster than the legacy CPU code, and that’s about it. Please do let me know if you can beat this 🙂

  • http://mcx.sf.net fangq

    quote cpu-gpu “did you ever optimize your “widely used” CPU code?”
    Why you assume I did not? The CPU code was profiled and optimized, and the speed reported in the paper was what I got (the code is open-source anyway). As I responded to sea.bird a moment ago, with an OpenCL code I wrote, I could get 2x speed up with SSE enabled on a single core of 5120, but that doesn’t change my conclusion. In fact, I encourage anyone who doubt about the speeds to really spend time to optimize these codes, so you can both prove your points and benefit the community.
    Since you mentioned MCML, I am sure you also knew MCML can only model layered media with axial-symmetry. That’s a big model difference to tMCimg and MCX, which can model arbitrary 3D random media in a voxelated space.

  • http://mcx.sf.net fangq

    quote cpu-gpu “If a researcher could not write a high performance CPU code (maybe he don’t have time on his “widely used” CPU code), how can we believe he/she can write a high performance GPU code?”
    I don’t want to irritate anyone here, but, does this sound like a scientific discussion to you?
    I am not taking sides on this and I don’t work for either nvidia or intel. I just want to give people a balanced view on the publications got involved in this argument: yes, these >100x GPU papers are peer-reviewed papers too! they were published for their own reasons, perhaps very legitimate ones. You have to read, understand, and even play with the codes in order to make a judgment. Speculations do not help at all!

  • sea.bird

    “Moreover, it is not realistic to expect a real-word computational code to have a throughput anywhere near the theoretical value, as it needs to do lots of math, lots of random memory IO, flow controls etc.”
    In the Top500 list, Jaguar (a huge cluster) can reach 75% of the peak performance on Linpack. It is realistic to expect a high-performance real-world program to reach >50% peak performance.
    I just want to point out one ground truth:
    The ratio between the raw computational powers of a 8800GT card and one Xeon 5120 CPU core is less than 30 to 40 times. Now, assume both GPU and CPU programs have same performance ratio (performance/peak raw power), then your new GPU program will be 30 to 40 times faster than CPU program. Based on the fact that a multi-thread program is harder to reach the same performance ration as a single-thread program, the gain should be even smaller.
    But, your reported gain is 300X!!!! Just think about this number, the only reason is your old CPU code only utilize less than 10% (maybe 5%) of the raw computation power of one Xeon 5120 core. If you can reach >50% theoretical GPU power, why not spend one or two weeks on your CPU code to also reach 50% theoretical CPU power?
    Also based on your claim: “it is not realistic to expect a real-word computational code to have a throughput anywhere near the theoretical value”, you GPU code also could not anywhere near the theoretical value, let’s say, it is reach 25% of the raw power of 8800FT GPU. Then, based on your 300X value, your CPU code only utilizes <3% of one Xeon 5120 core. Think about it: A program only utilizes 3% of one CPU core!
    By the way, I have no interest to improve your code, it is your own business. But, remember, the number "300X" is in your own paper; it is there and will stay there for very long time. There is no way to erase this number. If you want to write some woo-factor number in your paper next time, do your best to make it solid.