by Ian Buck

Tuesday’s live chat was a real success—thanks to all of you who joined us and participated. If you missed the chat, you can watch the replay, here. We were pleasantly surprised to receive so many great questions – while we couldn’t answer all of them during the chat itself, we’ve taken the chance to review and answer many of them for you here. Take a read and let us know what you think or if you have additional questions for me.

Reminder: The next chat takes place Thursday July 22 at 11am PDT with David Kirk. Stay tuned to the blog for more details.

Question from blackpawn:
I'm curious if there are any future plans or thoughts on exposing the hardware rasterizer to CUDA so we can say goodbye to OpenGL and DirectX. 🙂

Right now we don't have any plans for that. Graphics APIs are designed and tuned to be great at graphics. We have focused on good interoperability between CUDA C/C++ and the graphics APIs so you can easily switch back and forth without copying data, using the best tool for the job.

Question from Steve W:
Fermi supports __syncthreads_count(), letting you do a very simple kind of parallel sum without having to code the reduction itself. Is this implemented by microcode, library code, or it an actual hardware op?  Reduction (and sum prefix) are so useful and common, could future GPUs be extended to do block-wide reductions or sum-prefix computes as a single op (maybe over a few clocks) or does the manual classic way pretty much as efficient as can be expected?

The atomic operations to shared memory should help you out in these cases. Check out the CUDA Programming Guide for Fermi. (

Question from VizWorld:
Does NVidia believe they can sustain their amazing performance boosts via simply adding additional cores to the die? We've yet to see the 512-core Fermi's, is the next 'step' a 1024-core?

Certainly for HPC the problem sizes are very large. While increasing the core count can limit you on small sized problems, we haven't gotten close to typical problem sizes.

Question from VizWorld:
Currently the Tesla & GeForce/Quadro lines are only differentiated by the Tesla's lack of monitor connectivity. Do you envision a future where the Tesla system grows to something vastly different from the GeForce (eg, MultipleGPU's in high-order connectivity structures, radically different memory architectures, etc)?

Our Tesla products are very specialized for HPC today, with support for double precision, larger memory sizes, ECC support, cluster management features, and more. Quadro products are similarly focused on the needs of the professional customers. GeForce products are designed to delight gamers with award-winning performance in DirectX 11, 10, and 9 games; awesome visuals, courtesy of superior anti-aliasing and multi-sampling algorithms; real-time PhysX effects accelerated by the GPU for ultra realistic gaming environments; and support for 3D Vision technology, delivering the industry’s only consumer stereoscopic solution ideal for 3D gaming, 3D Internet streaming and 3D Blu-ray 3D.

Question from Ahmed Helal:
How costly (in terms of cycles) is context switching in CUDA (switching from a thread to another)?

Pretty Quick. Microseconds. The GPU will complete all the active work before switching to the next context.

Question from kiun:
Can we expect a way to debug kernels without the need of two computers any time?

On Linux this is already supported via cuda-gdb, Allinea DDT, and TotalView debuggers. On Windows, it’s a bit more tricky, since the windows manager is using the GPU for graphics compositing so hitting a breakpoint in your kernel means the windows manager stops updating.

Question from Guest:
Is there any chance of getting cuda-gdb on the Mac?

We're looking into that.

Question from devarde:
Are thare any plans to make a tool like Nexus but that works in an OS different than Windows such as Linux or Mac OS X

On Windows, where the vast majority of application developers are using Visual Studio, it makes sense for us to invest the significant engineering effort to develop a solution that integrates with the IDE. On Linux, where there are so many different IDE-type solutions (and different versions of each), we have a different strategy. Instead of picking one Linux IDE that only a small subset of Linux developers use, we are defining low-level debugging and performance analysis APIs that tools ISVs can use to incorporate CUDA debugging and performance analysis into their existing solutions. Some examples include debuggers like Allinea DDT and TotalView, and performance analysis tools like TAU and Vampir.

Also, where there are simple things we can do, like ensuring cuda-gdb works well with Emacs and DDD, the support is already there.