DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
About
Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably,or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement,we propose DSA-Tokenizer,which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints.Specifically,semantic tokens are supervised by ASR to capture linguistic content,while acoustic tokens focus on mel-spectrograms restoration to encode style.We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy,allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone.To speed up inference,we distill the DiT decoder to reduce sampling steps of inference to 4 and improve synthesis quality with GAN fine-tuning.Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement,reliable controllable voice cloning,and efficient high-fidelity generation with low WER/CER.Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation.Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Reconstruction | Chinese speech | UTMOS2.92 | 19 | |
| Speech Reconstruction | English speech | UTMOS3.67 | 19 | |
| Voice Conversion | LibriTTS (test-clean) | WER23.95 | 11 |