Get the latest tech news

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

Contribute to Mega4alik/ollm development by creating an account on GitHub.

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. Loading layer weights from SSD directly to GPU one by one Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention Offloading layer weights to CPU if needed FlashAttention-2 with online softmax. Analyze contracts, regulations, and compliance reports in one pass Summarize or extract insights from massive patient histories or medical literature Process very large log files or threat reports locally Analyze historical chats to extract the most common issues/questions users have

Get the Android app

Or read this on Hacker News