Get the latest tech news

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon


Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% ...

Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. FlagDescriptionRecommendation-t 8 Number of threads8 is optimal for most Apple Silicon chips--flash-attn Enables optimized attentionRecommended for Apple Silicon--kvq N Sets both key and value bits to NUse--kvq 8 for K8V4 configuration--kvq-key N Sets key bits onlyKey precision has major quality impact--kvq-val N Sets value bits onlyValue precision has minor quality impact-c N Context size in tokensLonger contexts benefit more from KVSplit-n N Number of tokens to generateAdjust based on your needs-f FILE Input fileFor processing documents-m MODEL Model pathPath to your .gguf model fileFor comprehensive performance analysis, use our full benchmark suite: Results are saved in CSV/JSON formats with automatic summary statistics, and the visualization script generates publication-quality plots showing key insights.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of apple silicon

apple silicon

Photo of longer contexts

longer contexts

Photo of kvsplit

kvsplit

Related news:

News photo

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

News photo

Development on Apple Silicon with UTM

News photo

'Made in America' Apple Silicon to Lag Behind Taiwan's Output