Get the latest tech news

Faster sorting with SIMD CUDA intrinsics (2024)

Full code on Github: https://github.com/wiwa/blog-code/ Hi Link to heading Recently, I finished a batch at the Recurse Center&mldr; is what I would have said if this post were written when I intended to write it (i.e. 3 months ago). My project there focused on a questionable application of CUDA (mostly irrelevant to this post), but it got me thinking more about other GPU-friendly algorithms. Instead of my Recurse project (which I hope to write about in a later post), I want to simply begin writing about technical stuff I’ve played around with.

I’ll go over the context behind around algorithm, a few basics of SIMD programming, a CUDA implementation, and how a small optimization grants it a +30% performance uplift. Although SIMD is a parallel programming model, the term is also used to refer to “vector extensions” to CPU ISAs like AVX (x86) and NEON (ARM). Vector instructions allow us to easily hardware-accelerate data-parallel algorithms like sorting networks where each element is “small”, i.e. about the size of a word (64 bits).

Get the Android app

Or read this on Hacker News