Get the latest tech news
Show HN: Mutable.ai Codebase chat that uses a Wiki for RAG
Dive into high-quality, AI-generated documentation for your repositories, powered by Mutable.ai Auto Wiki.
The Flash Attention implementation leverages techniques like split-KV, paged KV cache, and rotary positional embeddings to achieve high efficiency on modern CUDA-enabled GPUs, particularly Ampere and Hopper architectures. Read more The flash-attention library provides custom implementations of various activation functions, including Gaussian Error Linear Unit (GELU), ReLU, and SwiGLU, along with their corresponding backward passes. Read more The various utility functions in the flash-attention library focus on remapping state dictionaries and converting configurations between different pre-trained language model formats, such as BERT, GPT, GPT-NeoX, GPT-J, LLaMA, OPT, and Falcon.
Or read this on Hacker News