Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

About

A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.

Zhijie Huang, Stephen McIntosh, Daisuke Saito, Nobuaki Minematsu• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER7.1
833
Text-to-SpeechSeed-TTS (eval)
WER4
39
Voice ConversionVCTK
WER0.7
21
Speech RecognitionSwitchboard
WER18.6
18
Speech ReconstructionSalmon Sentiment Consistency emotional 2025b (OOD)
WER4.4
18
Text-to-SpeechLibriTTS clean (test)
WER0.042
15
Speech ReconstructionLibriSpeech clean (test)
WER2.4
15
Audio Encoding and Decoding EfficiencyNVIDIA A6000 Efficiency Benchmark
RTF (Encoding)9.00e-4
12
Speech ReconstructionGigaspeech noisy speech 2021 (OOD)
WER11.3
9
Speech ReconstructionJapanese Versatile Speech unseen language speech 2019 (OOD)
WER5.6
9
Showing 10 of 12 rows

Other info

Follow for update