Get the latest tech news
Making my local LLM voice assistant faster and more scalable with RAG
If you read my previous blog post, you probably already know that I like my smart home open-source and very local, and that certainly includes any voice assistant I may have. If you watched the video demo, you have probably also found out that it’s… slow. Trust me, I did too. Prefix caching helps, but it feels like cheating. Sure, it’ll look amazing in a demo, but as soon as I start using my LLM for other things (which I do, quite often), that cache is going to get evicted and that first prompt is still going to be slow.
After some more calculations in front of my breaker, I decided that if I use a specific outlet in the kitchen and set a low power limit (260W), I can safely run dual RTX 3090’s. When’s the last time you asked your voice assistant to summarize your entire house, or take action on every single device across multiple rooms? I also dynamically generate examples for in-context learning, where necessary, especially in places I found that LLMs tend to mistake service names.
Or read this on Hacker News