Get the latest tech news

Were RNNs all we needed? A GPU programming perspective


An implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.

The paper’s core claim is that by making minor simplifications to LSTMs and GRUs, their recurrence can be expressed in a form amenable to the parallel scan algorithm. My goal was to verify this claim by building both the simplified models (minGRU and minLSTM) and a custom CUDA implementation of the parallel scan to see how much of a speedup was actually achievable. My first major optimization was to fuse the gate computations for all time steps into a single, large kernel ( min_gru_extract_scan_params_kernel) that uses shared memory tiling to manage weights and inputs efficiently.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of CUDA

CUDA

Photo of o(log t

o(log t

Photo of o(t

o(t

Related news:

News photo

AMD tries to catch CUDA with performance-boosting ROCm 7 software

News photo

You can get Nvidia's CUDA on three popular enterprise Linux distros now - why it matters

News photo

Accelerated Game of Life with CUDA / Triton