Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens

About

We propose Llama-Mimi, a speech language model that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting the inherent challenge of maintaining long-term coherence. We additionally introduce an LLM-as-a-Judge-based evaluation to assess the spoken content quality of generated outputs. Our models, code, and speech samples are publicly available.

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka• 2025

Related benchmarks

TaskDatasetResultRank
Speech Acoustic UnderstandingSALMon
Salmon Score73.6
10
Speech Semantic UnderstandingsBLIMP
sBLIMP Score55.1
10
Speech Semantic UnderstandingsWUGGY
sWUGGY Accuracy68.8
10
Neural Audio CodingNeural Audio Coding
Frame Rate (Hz)12.5
7
Showing 4 of 4 rows

Other info

Follow for update