Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling

About

Speech Language Models (SpeechLMs) model tokenized speech to capture both semantic and acoustic information. When neural audio codecs based on Residual Vector Quantization (RVQ) are used as audio tokenizers, they produce multiple discrete tokens per time step, yielding inherently multi-level representations. To process these multi-level tokens together, prior work typically adopts hierarchical architectures to capture this structure. In contrast, recent progress in NLP has progressively reduced architectural inductive biases, moving toward simpler and more scalable single-Transformer architectures. In this work, we propose Llama-Mimi, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder. We show that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency. Our models, code, and speech samples are publicly available.

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka• 2025

Related benchmarks

TaskDatasetResultRank
Semantic and linguistic knowledge evaluationZeroSpeech
sBLiMP Score55.1
20
Acoustic and paralinguistic modelingSALMon
Acoustic Consistency (Sentence)79
19
Discourse-level coherence evaluationTopic Story-Cloze (tSC)
tSC Score67.6
19
Speech GenerationLibrispeech (test-clean)
Speaker Similarity0.915
11
Speech Acoustic UnderstandingSALMon
Salmon Score73.6
10
Speech Semantic UnderstandingsBLIMP
sBLIMP Score55.1
10
Speech Semantic UnderstandingsWUGGY
sWUGGY Accuracy68.8
10
Linguistic EvaluationsWUGGY
sWUGGY Score68.7
7
Neural Audio CodingNeural Audio Coding
Frame Rate (Hz)12.5
7
Linguistic EvaluationsBLIMP
sBLIMP Score54.3
7
Showing 10 of 12 rows

Other info

Follow for update