Get the latest tech news
Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput
Contribute to Mega4alik/ollm development by creating an account on GitHub.
oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. Loading layer weights from SSD directly to GPU one by one Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention Offloading layer weights to CPU if needed FlashAttention-2 with online softmax. Analyze contracts, regulations, and compliance reports in one pass Summarize or extract insights from massive patient histories or medical literature Process very large log files or threat reports locally Analyze historical chats to extract the most common issues/questions users have
Or read this on Hacker News