Get the latest tech news
Run Llama locally with only PyTorch on CPU
Run and explore Llama models locally with minimal dependencies on CPU - anordin95/run-llama-locally
I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch(or some minimal set of dependencies), though I am obviously grateful and so pleased with their contribution of the public weights! Using CPU, I can pretty comfortably run the 1B model on my Mac M1 Air's that has 16GB of RAM averaging about 1 token per second. I suspect that the relatively higher memory-load of the GPU (caused for unknown reasons) in conjunction with a growing sequence length starts to swamp my system's available memory to a degree which effects the computation speed.
Or read this on Hacker News