Get the latest tech news
Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% ...
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. FlagDescriptionRecommendation-t 8 Number of threads8 is optimal for most Apple Silicon chips--flash-attn Enables optimized attentionRecommended for Apple Silicon--kvq N Sets both key and value bits to NUse--kvq 8 for K8V4 configuration--kvq-key N Sets key bits onlyKey precision has major quality impact--kvq-val N Sets value bits onlyValue precision has minor quality impact-c N Context size in tokensLonger contexts benefit more from KVSplit-n N Number of tokens to generateAdjust based on your needs-f FILE Input fileFor processing documents-m MODEL Model pathPath to your .gguf model fileFor comprehensive performance analysis, use our full benchmark suite: Results are saved in CSV/JSON formats with automatic summary statistics, and the visualization script generates publication-quality plots showing key insights.
Or read this on Hacker News