by Ian Buck

Hey there, my name is Ian Buck and I’m the Software Director of GPU Computing here at NVIDIA. I helped start the CUDA team six years ago and have had the wonderful pleasure of watching it grow and change the world of high performance computing. Before my time here at NVIDIA, I was the development lead on Brook which was the forerunner to generalized computing on GPUs.

I wanted to introduce myself because I’ll be the guest speaker on a live text-chat here on the blog Tuesday July 13, 11am PDT.  I’ll be dishing out my thoughts on GPU computing and answering your questions, like how did we come up with the name “CUDA”. You can set an e-mail reminder for yourself in the widget below or you can bookmark this post which will be updated with the chat window on Tuesday.

This is actually the first Live Chat in a blog-series that we’re hosting prior to the GPU Technology Conference (GTC) in September . If you haven’t heard about GTC, here’s a little info, and here’s a link to all GTC related blog posts.  I’ll be speaking at GTC this year about the GPUs evolution as a general computing processor, and last year I gave a talk called From Brook to CUDA.

The live chat is a cool opportunity prior to GTC for speakers like me to engage with you, maybe even get some ideas of what we should focus on. I’m always interested in hearing new and interesting applications for GPUs.

If you have any questions for me, but won’t be able to attend the live chat, please drop them in the comments below. We’ll pick a few, and of course give you credit.

Looking forward to chatting…

Similar Stories

  • oscarbg

    Hi Ian,
    I have a lot technical but I hope interesting questions to ask you:
    *Can you provide a CUDA roadmap similar to AMD Stream SDK one I just found till end of year:
    *About Fermi remaining features to be exposed in CUDA:
    Seems Fermi has 3D grid support as DirectCompute 5.0 needs it but CUDA is limited to 2D grids of thread blocks is that limitation going away in CUDA/OpenCL soon?
    *Now that CUDA 3.1 has function pointer and recursion support what about stack allocation support (calloc C call) and malloc inside CUDA kernels (I have read your coauthored XMalloc paper so seems work is in progress..)
    Can you provide a timeframe of avaiability of these features.. also can you speak situations where Fermi support for host function calls makes sense to use and when to expect CUDA supporting it..
    And finally from reading Hwu CUDA book seems Fermi supports kernel cancellation so are you adding to CUDA soon? it will be a host function or inside kernel callable so you can for example for massive search problems where each thread search trough a portion of search space a thread can cancel the kernel when an item is found by this thread..
    *Reading OpenGL 4.0 ARB drawindirect extension seems a similar CUDA trick would be to be able to call kernels with grid and block dimensions getted from GPU mem without GPU to host transfer so you can avoid a GPU to host transfer. That would be useful for example in a kernel that compacts data for a later kernel operating on that data where the grid and block dimensions should be set by the previous compacting kernel and be related to compacted number of items which would be in GPU mem. In this case avoiding GPU 2 host transfer with compacted data size could allow 2x speedup assuming kernel launch time is similar to host to GPU latency of sending a few bytes..
    *Seems AMD have some hardware features missing on Nvidia GPUs. What do you think about:
    *Shared registers per SM where every thread in same SM can access some per SM registers.. In fact every thread in warp can access diferent set of shared registers but this allows for blocks running on same SM communicating.. This seems useful for faster reductions for example.. Right now in CUDA the register space assigned to one thread is not visible from other threads..
    *Global Data Share: Could be useful to implement in hardware and expose in CUDA?
    Can be emulated in Fermi already with threads sharing data via L2 cache?

  • avidday

    I would be interested to hear Ian’s thoughts on the “brand-new” compute 2.1 devices from a CUDA/software perspective, especially the new out-of-order style instruction issue features of the 48 core SM.

  • Ahmad Lashgar

    I’ve 2 question:
    1-How G80 SM unit execute 32 thread (single warp) at same time. SM only has 8 scalar SP units. Eric Lindholm et al. mentioned it is done in 2 group of 16 threads in 4 cycles but how? what happens in every single of 4 cycles?
    2-Is it right that each SFU is vector unit (4-width SIMD)?

  • Doug (

    I too would be interested in hearing about the challenges of creating a thread scheduling and compilation system which is capable of supporting 32 and 48 core SMs (GF100/GF104) with out-of-order scheduling required on the 48 core SM for maximum instruction throughput.

  • Doug (

    Also, what are your thoughts concerning the recent ISCA’10 article, “Debunking the 100x GPU vs. CPU Myth …”?
    Specifically are their any plans to change or improve the compilation and/or scheduling system based upon the analysis in this article?
    I think the article points out the need for having a throughput oriented SPEC benchmark to assess throughput oriented architectures such as GPUs. In the article, the authors have provided a “first-cut” of what a possible SPEC throughput benchmark suite may look like.
    What are your thoughts about this and do you believe NVIDIA is willing to push for the addition of such a benchmark suite to the SPEC Consortium?

  • Bystander

    1) Can we get, with acceptable perf, a single shared address space between the CPU and GPU? It would not be surprising if your page table hardware was already almost capable of dealing with this.
    2) Going to the next level, how about multi-GPU address space? Single address space but obviously, non-uniform memory access. Presumably you’d need to add physical address bits and some inter-GPU glue logic but it would simplify multi-GPU programming a lot IMHO.
    3) Can you leverage flash to back the graphics memory and let us swap TB sized data sets? This would let us bypass the arguments about disk io and over-priced SAN setups.

  • Bystander

    Also, in terms of product segmentation, its unfortunate that the $500 cards have castrated double throughput relative to the high-end compute nodes.
    Is ECC support and perhaps very large memory capacity not sufficient to segment the market without dropping the number of double precision flops on the lower end cards? It weakens the CPU vs. GPU case for us (we can use the high-end cards in our servers but would like clients to be able to run on their desktops without dedicated compute cards in place of GPUs).

  • Manfred

    I bought a used MSI Geforce 7600 GS 512 MB AGP (passiv cooled) card. Whenever I install a newer driver (e.g. 257.21 WHQL), I get a “CPLINSTALLFAIL_DRIVERUNINSTALL” message and the installation does a rollback. Only 185.85 from the MSI website works. Whats wrong?