Get the latest tech news

The Sweet Spot: Maximizing Llama Energy Efficiency

dr: limit your GPUs to about 2/3rd of maximum power draw for the least Joules consumed per token generated without speed penalty. – Why run an LLM yourself in the first place? The llama.cpp software suite is a very impressive piece of work.

The ‘sweet spot’ is pretty clear, the graph starts off on the left at about 19 Joules per token because the simple fact that the machine is on dominates the calculation. If you live in a cold climate you may think of this as co-generation ;) In an extreme case you might run into thermal throttling, which is because of the way llama.cpp handles the interaction between the CPU and GPU just going to lead to a lot of wasted power. But for me this method helped run some pretty big jobs at an affordable budget, and it just so happens that the optimum efficiency also coincides with very close to peak performance in terms of tokens generated per unit time.

Get the Android app

Or read this on Hacker News