Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling
About
A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER7.1 | 833 | |
| Text-to-Speech | Seed-TTS (eval) | WER4 | 39 | |
| Voice Conversion | VCTK | WER0.7 | 21 | |
| Speech Recognition | Switchboard | WER18.6 | 18 | |
| Speech Reconstruction | Salmon Sentiment Consistency emotional 2025b (OOD) | WER4.4 | 18 | |
| Text-to-Speech | LibriTTS clean (test) | WER0.042 | 15 | |
| Speech Reconstruction | LibriSpeech clean (test) | WER2.4 | 15 | |
| Audio Encoding and Decoding Efficiency | NVIDIA A6000 Efficiency Benchmark | RTF (Encoding)9.00e-4 | 12 | |
| Speech Reconstruction | Gigaspeech noisy speech 2021 (OOD) | WER11.3 | 9 | |
| Speech Reconstruction | Japanese Versatile Speech unseen language speech 2019 (OOD) | WER5.6 | 9 |