StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
About
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Reconstruction | LibriSpeech clean (test) | WER3.84 | 19 | |
| Audio Reconstruction | Seed EN | Word Error Rate (WER)3.44 | 15 | |
| Audio Reconstruction | Seed-ZH | WER2.62 | 15 | |
| Noise Robustness | FLEURS (test) | Robustness Score (Gaussian Noise)12.93 | 15 | |
| Audio Reconstruction | LibriSpeech Clean | WER3.84 | 11 | |
| Audio Reconstruction | LibriSpeech Other | WER7.99 | 11 | |
| Speech Reconstruction | Librispeech other (test) | WER7.99 | 4 |