Get the latest tech news
An Almost Pointless Exercise in GPU Optimization
Experience converting a multi-threaded C++ application to run faster on GPU. How to interpret NSight Compute recommendations to improve an algorithm on GPU.
Early on, I decided to get comfortable using the Nsight Compute tool to analyse the GPU portion of my code in spectacular detail, rather than trying to rely on intuition and the high-level utilization figures from nvidia-smi. (This introduces the only inter-thread synchronization mechanism needed so far, which is the use of a CUDA atomic operation to read and increment the next_deal_to_play index, which barely affects speed but solves the race condition.) Thankfully for my program, using this only involves a modest code change, and basically means the inner game loop runs for fewer iterations before the local backlog is exhausted and has to be replenished.
Or read this on Hacker News