Get the latest tech news

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

Serving Llama2-70B on an RTX4090 with Sequoia

Apart from offloading, Sequoia provides a hardware-aware solution to adjust the size and depth of speculation trees to adapt to different hardware platforms. Leveraging a large speculation budget, everyone can use RTX 4090 or other consumer (low-cost) GPU, e.g., AMD RX7900 with Sequoia to host very strong LLMs like 70B model without approximation, boosting the applications of AI generated content. Moreover, Sequoia, as a speculative decoding framework which mitigates the gap in the memory hierarchy, adapts to any draft/target pairs and any AI accelerators.

Get the Android app

Or read this on Hacker News