Get the latest tech news
SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency
Serving Llama2-70B on an RTX4090 with Sequoia
Apart from offloading, Sequoia provides a hardware-aware solution to adjust the size and depth of speculation trees to adapt to different hardware platforms. Leveraging a large speculation budget, everyone can use RTX 4090 or other consumer (low-cost) GPU, e.g., AMD RX7900 with Sequoia to host very strong LLMs like 70B model without approximation, boosting the applications of AI generated content. Moreover, Sequoia, as a speculative decoding framework which mitigates the gap in the memory hierarchy, adapts to any draft/target pairs and any AI accelerators.
Or read this on Hacker News