Get the latest tech news

High-Fidelity Simultaneous Speech-to-Speech Translation


We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

View a PDF of the paper titled High-Fidelity Simultaneous Speech-To-Speech Translation, by Tom Labiausse and 5 other authors Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of fidelity

fidelity

Photo of translation

translation

Photo of Simultaneous Speech

Simultaneous Speech

Related news:

News photo

Tighten up your cap table with Fidelity, Cimulate, and DepositLink at TechCrunch All Stage 2025

News photo

Fans slam The Alters after discovering evidence of undisclosed gen AI in images, text, and translation

News photo

Thanks to Fidelity and our amazing sponsors, TechCrunch All Stage is where startups rise