Get the latest tech news
Were RNNs all we needed? A GPU programming perspective
An implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.
The paper’s core claim is that by making minor simplifications to LSTMs and GRUs, their recurrence can be expressed in a form amenable to the parallel scan algorithm. My goal was to verify this claim by building both the simplified models (minGRU and minLSTM) and a custom CUDA implementation of the parallel scan to see how much of a speedup was actually achievable. My first major optimization was to fuse the gate computations for all time steps into a single, large kernel ( min_gru_extract_scan_params_kernel) that uses shared memory tiling to manage weights and inputs efficiently.
Or read this on Hacker News