Get the latest tech news
Bringing K/V context quantisation to Ollama
K/V context cache quantisation has been added to Ollama. This enables significant reductions in VRAM usage, allowing users to realise the potential of expanded context sizes and run larger models at their existing context sizes.
My PR integrated that functionality into Ollama which involved not just supporting the required configuration, but implementing memory estimations for layer placement, error and condition handling, ensuring compatibility with the existing codebase and a lot of testing. Larger context sizes allow for more nuanced and relevant output.QuantisationA technique for reducing the precision of numerical values, resulting in smaller data sizes.Q8_0 & Q4_0Different levels of quantisation, with Q8_0 halving the VRAM usage of the context and Q4_0 reducing it to one third compared to F16 (unquantised).llama.cppThe primary underlying inference engine used by Ollama.Flash AttentionA technique used to reduce the memory requirements of LLMs by only attending to a subset of the context at a time.ROCmThe AMD Radeon Open Compute platform, an open-source platform for GPU computingCUDAA parallel computing platform and application programming interface model created by NvidiaMetalA low-level, low-overhead hardware-accelerated graphics and compute application programming interface developed by Apple. Originally I had several features which in the PR that were not included in the final version as Ollama wanted to minimise the configuration exposed to users and not introduce API changes.
Or read this on Hacker News