• http://profiles.google.com/rtfss1 rtfss none

    First congratulations for sharing such early experiences on GK110 with all people..

    Hope you have time to answer some questions/suggestions (and hope aren’t much stupid after all):

    *to me the code is not the same as without dyn parallelism as you could also use 2 different streams and then should concurrent kernel execution (which should be working with Hyper-Q avoiding false dependencies).. should with that optimizations be even 2x speedup?

    *Is this sample scaling over GK104 I mean is code without dyn parallelism scaling over GK104 nearly proportionaly to  #SMs GK110 / #SMs GK104 assuming same SM clocks?

    *Seems this “simple quick sort code” is not the most efficient.. I mean it would not achieve peak speed of Duane Merill sort code integrated into Thrust.. Do you expect using dynamic parallelism code in the highest perf codes such as Duane Merill or MSORT to bring also such high speedups (2x) over current code on GK110.. i.e. can we expect thanks to that very high speedups in sorting in GK110 HW..

    *Seems currently best sorting rates for “small” arrays are by a wide margin achieved using HotSort from PixelIO.. Hope Nvidia and PixelIO collaborate to bring even further speed sorting codes using GK110 features..
    which brings to next question (sorry is not the correct location to post hope you can redirect or inform me whre to post while cuda forums are offline):

    *I think NV should should start something as Intel ManyCore Testing Lab where they put very $$$ HW (like 40cores on one host and son they said will have Xeons Phi) for testing remotely code for free to academics.. Hope you can put some Tesla K20s on start some service for students who can’t access  

    *Hope you put this simple on the SDK as you say..
    And for suggestions in SDK using Dyn parrallelism a “simple” multigrid code for say solving Poisson equation using Dyn parrallelism would be okay..

  • http://twitter.com/heg531 @heg53

    I have a new algorithm of my authorship. it is based on the principle of divide and conquer and I have programmed in C language. I think that could be calculated in this architecture. Do you could program for K-20 if I send the code?.
    regards

  • nvjones

     Thanks for the detailed response! You’ve asked a lot of questions, so I’ll work through them one at a time (I’m numbering based on your * comments above).

    1. The “without dynamic parallelism” code does use separate streams for each launch of a stage (see the last 3 lines of the code sample), so this should be a fair comparison. The limiter is not concurrency between kernels of a given stage, but rather that each stage must finish before the CPU can launch the next one.

    2. The
    graph shows both runs performed on a K20. I have not compared the host-launch
    algorithm performance between K10 and K20, although I would expect to see
    equivalent behaviour because the limiting factor is the inter-stage
    synchronisation overhead.

    3. Quicksort
    is a comparison sort, which allows sorting of arbitrary data, so it cannot be
    compared with Radix Sort which can only compare bitfield sorts (for example,
    you could not radix-sort complex numbers based on modulus). The time complexity
    of the two algorithms is different because they perform different tasks. There is no need to use dynamic parallelism for bitfield sorts such as
    radix sort, because there is no intrinsic data dependence and the GPU already
    performs very well at these.

    4. I’m afraid I can’t comment on HotSort’s approach, although it appears that they also do a bitfield sort and so again it is not directly comparable.

    5. Thank you for the suggestion.

    6. The
    SDK for CUDA 5.0 will indeed have both a simple and an advanced Quicksort
    sample. These are designed to illustrate the programming model so we’ve kept
    them as simple as possible, but you’ll be able to see how a basic partitioning
    function would work.Thanks again for good feedback!

  • nvjones

    We’re always interested to hear about algorithms where dynamic parallelism might apply – could you be more specific about what you are working on? I’m afraid I won’t personally be able to help with porting your code, but I’ve often found the people on http://www.stackoverflow.com to offer good advice and help.

  • http://profiles.google.com/rtfss1 rtfss none

     Really thanks for your time and detailed response.. it shows I have more excitement than knowledge in fast GPU sorters :-)

    I won’t promise you but I think this is the last mega post here so I won’t take more of your time.. I think I have some good suggestions for 5.0 SDK I have been thinking lately.. perhaps sorry posting here but as said using Nvidia forums is no option now..

    Before that only say I’m happy NV is well recieving the suggestion on starting somthing like Intel Many Core testing lab.. hope it materializes soon after time of Tesla K20s release..

    Here are the suggestions for CUDA 5.0 SDK:
    (mostly I’m interested in graphics/CUDA interop for questions clearer at the bottom)

    *Seems CUDA 5.0RC ships with CUDA BLAS device library for GK110 altough no documentation present currently and perhaps even not support in CUBLAS headers for using it.. Hope this gets fixed and a simple simple using CUBLAS on device shows..

    *Would be good to ship some simple HyperQconckernel example based perhaps on concurrentkernel example in SDK that shows some case where Fermi conc kernel execution isn’t exploitable based false serializations in single HW queue wouldn’t it?

    *Seems new CUDA 5.0 texture object is for exposing Kepler bindless textures.. but if that objects can be created from OpenGL “regular” textures or even OGL bindless texture (via NV_bindless_texture) is a good question.. so would be good:
    ->a sample showing creation of CUDA texture objects from OGL regular and new OGL bindless textures..

    *Seems new CUDA 5.0 texture object (for using bindless tex and surfaces) allows using compressed tex formats (even new BPTC format) using cudaResourceViewDesc option.. Shame is that texture objects is Kepler feature only and I don’t have any Kepler right now but hope that is working.. so here some suggestions on compressed tex support in CUDA:

    ->ship some sample in SDK showing usage of compressed texes via texture objects
    ->in same sample or new show creation of tex object of OpenGL/D3D  compressed textures.. I don’t know if that is posible right now if not for future CUDA versions..

    now for future CUDA versions:

    ->allow compressed textures using “standard” textures.. this would allow working with that on Fermi and older GPUs..

    *In this blog NV has show also good scaling of MPI codes via HyperQ and seems that exploits some multiple host processes via single CUDA context tech using nv-

    proxy-control nv-proxy-server tech..
    make sense to expose on SDK documentation and/or example of exploting that feature on general processes not related to MPI codes i.e. how to exploit bassically

    HyperQ feature and concurrent kernels feature to execute concurrently multplie GPU processes via one CUDA context..

    *Can we expect to see in final SDK a sample exploiting the new H.264 HW encoder in Kepler GPUs.. would be good to expose a sample that shows how to directly

    encode visualization of GPU simulation results in a video stream .. for example extending Nbody sample that compressed an H.264 video of what is shown in screen

    would be a good example of H.264 HW encoder is independent of GPU SMs and don’t affect per of simulation

    Now I finish.. mostly..

    I’m interested in CudaRaster (opengl like via CUDA), Optix (raytracing language via CUDA) and VoxelPipe (3d raster i.e. voxelization via CUDA with shaders) so I think that for exetended programable pipelines to be a success one great step is to expose all fucntionality of graphics shading languages to CUDA so for that cause I’m expecting/suggesting future CUDA versions to expose (I think not supported now):
    *Multisample textures with new cuda functions(tex2DMSAA() and the like)
    *support for creating textures from OpenGL/D3D depth buffer (that is for hybrid raytracing right now requires depth to color copy in graphic APIs)
    *Compressed texes for Fermi and below: not only via object textures..
    *I think new OGL 4/D3D 11 gather4 instuctions aren’t in CUDA also yet..
    Please see http://anteru.net/2011/12/06/1815/ for another guy interested in this (altough he is interested in OpenCL support)..

    Many thanks..

  • http://twitter.com/heg531 @heg53

    Very nice for taking the time to answer. With pleasure I send an article and source code in C language for you to see if you can run it in this architecture. I hope I can be useful and we can use it as an example of numerical efficiency. The numerical complexity of this algorithm is exponential and if all goes well may be reduced to polynomial complexity. If everything turned out well, I would expect us to publish something and you could advertise the algorithm as an example.Could you send a mail where to send this safely?

    regards

  • Jackson Beatty

    One or more code samples using Thrust in the context of dynamic parallelism would be very helpful.

  • Cristobal Navarro

    one small question, you first call the kernel with one block of 1 thread? not sure about that.

  • Sagar Rawal

    Congratulations on a fantastic demonstration of the hardware prowess of Tesla K20!

    The anticipation builds in everyone for the release of such a groundbreaking product!

  • david macpherson

    Stephen,

    Thanks for the quicksort example. Perhaps including a link to the partition function & a makefile would be helpful.

  • Biao Wang

    I have the same question, what is the grid configuration which is substituted by three dot  in each kernel call?

  • Jack Jones

    It would be nice to have the partition source code…

  • Jonh Rain

    Hello. I just got a GTX660 thinking that i could use dynamic parallelism, but it seems only the GTX TITAN can do it? is that correct?

  • tpofofnt

    Stephen,

    In a recent talk, you made the comment that the overhead for launching a kernel on the device is precisely the same as the overhead for launching a kernel on the host.  You went on to say that if device kernel launches were batched, say in a batch of 250 launches, the overhead for each kernel launch would be 1/250 the overhead of a  host kernel launch.

    What is this batch kernel launch from the device you speak of?  Are you referring to the situation where every thread from a parent kernel launches the same child kernel?

    Thanks in advance!

    Mitch Horton