Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Speech Tokenizers with Diffusion Autoencoders

About

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS en (test)
WER2.46
90
Speech ReconstructionSeedTTS en (test)
WER0.028
18
Speech UnderstandingDASB
Error Rate (ER)63.5
6
Showing 3 of 3 rows

Other info

Follow for update