HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
About
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER12.65 | 1206 | |
| Automatic Speech Recognition | AISHELL-1 (test) | -- | 105 | |
| Automatic Speech Recognition | Librispeech (test-clean) | WER5.48 | 96 | |
| Text-to-Speech | Seed-TTS zh (test) | -- | 87 | |
| Text-to-Speech | Seed-TTS-Eval (test) | WER1.33 | 32 | |
| Text-to-Speech | Seed-TTS Seed-EN (test) | WER0.072 | 32 | |
| Text-to-Speech | Seed-TTS hard (test) | WER7.59 | 7 | |
| Text-to-Speech | Seed-TTS-zh Eval (test) | WER0.98 | 3 | |
| Text-to-Speech | Emergent-TTS Emotion (test) | WER (%)1.34 | 3 | |
| Text-to-Speech | Emergent-TTS Paralinguistic (test) | WER (%)34.47 | 3 |