Get the latest tech news
Fast LLM Inference From Scratch (using CUDA)
Pushing single-GPU inference throughput to the edge without libraries This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries. Why? In doing so, we can learn about the full stack of LLM inference - which is becoming increasingly important That's what we'll focus on: building a program that can load weights of common open models and do single-batch inference on them on a single CPU + GPU server, and iteratively improving the token throughput until it surpasses llama.cpp.
That's what we'll focus on: building a program that can load weights of common open models and do single-batch inference on them on a single CPU + GPU server, and iteratively improving the token throughput until it surpasses llama.cpp. AlmostThere are also state-space models like Mamba which purport to be more efficient and scalable to long sequences than transformers, but they don't appear to have found much success outside of niches like low-power ML and non-discrete data domains like audio/video. For prompt completion and use cases like generating essays, the “decode phase” takes up the majority of execution and involves computing attention between the past context and just a single token (or query timestep).
Or read this on Hacker News