Get the latest tech news
Looking Back at Speculative Decoding
back at speculative decoding December 6, 2024 Yaniv Leviathan, Distinguished Engineer, Matan Kalman, Software Engineer, and Yossi Matias, Vice President & Head, Google Research Speculative decoding has proven to be an effective technique for faster and cheaper inference from LLMs without compromising quality. It has also proven to be an effective paradigm for a range of optimization techniques.
The approach is inspired by speculative execution, an optimization technique whereby a task is performed before or in parallel with the process of verifying whether it is actually needed, resulting in increased concurrency. Note that in the special case of greedy decoding, where we always sample the single most probable token, speculative execution can be applied effectively to LLM inference, as was shown in a precursor to our work. In addition, we’d like to extend a huge thank you for reviews, help, insightful discussions, valuable feedback and support to YaGuang Li, Blake Hechtman, Tao Wang, Toby Boyd, Nathan Lintz, Phil Chen, Nir Shabat, Jayant Madhavan, Aliaksei Severyn, Jakub Adamek, Jonathan Mallinson, Zhifeng Chen, Yoel Drori, Mariano Schain, Charlie Chen, Noam Velan, Nitish Kulkarni, Sidharth Mudgal, Sasha Goldshtein, Nadav Sherman, Pilar Manchon, Fernando Pereira, Eyal Segalis, Eyal Molad, Dani Valevski, Daniel Lumen, Valerie Nygaard, Steve Baker, Srinivasan (Cheenu) Venkatachary, Hema Budaraju, Ziteng Sun, Ananda Theertha Suresh, Elizabeth Hamon Reid, Jeff Dean, Prabhakar Raghavan, James Manyika, and teams in Google Research, Google Deepmind, and Google Search.
Or read this on Hacker News