StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

About

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.

Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou• 2025

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Librispeech (test-clean)	UT MOS4.09	64
Audio Understanding	MMSU	Perception Score31.98	37
Speech Reconstruction	LibriSpeech clean (test)	WER3.84	25
Audio Reconstruction	Seed EN	Word Error Rate (WER)3.44	20
Audio Reconstruction	Seed-ZH	WER2.62	15
Noise Robustness	FLEURS (test)	Robustness Score (Gaussian Noise)12.93	15
Audio Understanding	MMAU	Overall Score53.2	14
Audio Reconstruction	LibriSpeech Clean	WER3.84	11
Audio Reconstruction	LibriSpeech Other	WER7.99	11
Speech Reconstruction	Librispeech other (test)	WER7.99	9

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord