Get the latest tech news
Symmetric Power Transformers
A linear transformer that learns like a regular transformer with a state that fits on a GPU.
This allowed us to validate the learning ability of the architecture without writing custom CUDA kernels (which an efficient implementation of chunked symmetric power transformers requires). We would like to thank Warfa Jibril, Jono Ridgway, Saurabh Kumar, Justin Dieter, Fabrice Normandin, and Imanol Schlag for their feedback on an earlier draft of this post, and Txus Bach for correcting the state size calculations. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol.
Or read this on Hacker News