Get the latest tech news

Fast LLM Inference From Scratch (using CUDA)


Pushing single-GPU inference throughput to the edge without libraries This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries. Why? In doing so, we can learn about the full stack of LLM inference - which is becoming increasingly important That's what we'll focus on: building a program that can load weights of common open models and do single-batch inference on them on a single CPU + GPU server, and iteratively improving the token throughput until it surpasses llama.cpp.

That's what we'll focus on: building a program that can load weights of common open models and do single-batch inference on them on a single CPU + GPU server, and iteratively improving the token throughput until it surpasses llama.cpp. AlmostThere are also state-space models like Mamba which purport to be more efficient and scalable to long sequences than transformers, but they don't appear to have found much success outside of niches like low-power ML and non-discrete data domains like audio/video. For prompt completion and use cases like generating essays, the “decode phase” takes up the majority of execution and involves computing attention between the past context and just a single token (or query timestep).

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Scratch

Scratch

Photo of CUDA

CUDA

Photo of fast llm inference

fast llm inference

Related news:

News photo

Implementing a simple object system from scratch in Ruby

News photo

Show HN: I designed an espresso machine and coffee grinder

News photo

Diamonds can now be created from scratch in the lab in 15 minutes