Get the latest tech news

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput


Contribute to Mega4alik/ollm development by creating an account on GitHub.

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. Loading layer weights from SSD directly to GPU one by one Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention Offloading layer weights to CPU if needed FlashAttention-2 with online softmax. Analyze contracts, regulations, and compliance reports in one pass Summarize or extract insights from massive patient histories or medical literature Process very large log files or threat reports locally Analyze historical chats to extract the most common issues/questions users have

Get the Android app

Or read this on Hacker News

Read more on:

Photo of GPU

GPU

Photo of throughput

throughput

Photo of GB GPU

GB GPU

Related news:

News photo

Apple Silicon GPU Support in Mojo

News photo

GPU sales skyrocketed 27% last quarter — tariff jitters sparked an odd gaming hardware spending surge in Q2 '25

News photo

Gluon: a GPU programming language based on the same compiler stack as Triton