Get the latest tech news

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency


Serving Llama2-70B on an RTX4090 with Sequoia

Apart from offloading, Sequoia provides a hardware-aware solution to adjust the size and depth of speculation trees to adapt to different hardware platforms. Leveraging a large speculation budget, everyone can use RTX 4090 or other consumer (low-cost) GPU, e.g., AMD RX7900 with Sequoia to host very strong LLMs like 70B model without approximation, boosting the applications of AI generated content. Moreover, Sequoia, as a speculative decoding framework which mitigates the gap in the memory hierarchy, adapts to any draft/target pairs and any AI accelerators.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Sequoia

Sequoia

Photo of RTX4090

RTX4090

Related news:

News photo

Sequoia’s Jess Lee will demystify product-market fit at TechCrunch Early Stage 2024

News photo

YouTube and PayPal founding engineers get backing from Sequoia to build a “no web UI” AI assistant.

News photo

Why Sequoia is funding open source developers via a new equity-free fellowship