Get the latest tech news

Were RNNs all we needed? A GPU programming perspective

An implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.

The paper’s core claim is that by making minor simplifications to LSTMs and GRUs, their recurrence can be expressed in a form amenable to the parallel scan algorithm. My goal was to verify this claim by building both the simplified models (minGRU and minLSTM) and a custom CUDA implementation of the parallel scan to see how much of a speedup was actually achievable. My first major optimization was to fuse the gate computations for all time steps into a single, large kernel ( min_gru_extract_scan_params_kernel) that uses shared memory tiling to manage weights and inputs efficiently.

Get the Android app

Or read this on Hacker News