Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

About

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER12.65
1206
Automatic Speech RecognitionAISHELL-1 (test)--
105
Automatic Speech RecognitionLibrispeech (test-clean)
WER5.48
96
Text-to-SpeechSeed-TTS zh (test)--
87
Text-to-SpeechSeed-TTS-Eval (test)
WER1.33
32
Text-to-SpeechSeed-TTS Seed-EN (test)
WER0.072
32
Text-to-SpeechSeed-TTS hard (test)
WER7.59
7
Text-to-SpeechSeed-TTS-zh Eval (test)
WER0.98
3
Text-to-SpeechEmergent-TTS Emotion (test)
WER (%)1.34
3
Text-to-SpeechEmergent-TTS Paralinguistic (test)
WER (%)34.47
3
Showing 10 of 10 rows

Other info

Follow for update