Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

About

A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.

Zhijie Huang, Stephen McIntosh, Daisuke Saito, Nobuaki Minematsu• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER7.1	1207
Text-to-Speech	Seed-TTS (eval)	WER4	39
Text-to-Speech	LibriTTS clean (test)	WER0.042	30
Speech Reconstruction	LibriSpeech clean (test)	WER2.4	25
Voice Conversion	VCTK	WER0.7	21
Speech Recognition	Switchboard	WER18.6	20
Speech Reconstruction	Salmon Sentiment Consistency emotional 2025b (OOD)	WER4.4	18
Audio Encoding and Decoding Efficiency	NVIDIA A6000 Efficiency Benchmark	RTF (Encoding)9.00e-4	12
Speech Reconstruction	Gigaspeech noisy speech 2021 (OOD)	WER11.3	9
Speech Reconstruction	Japanese Versatile Speech unseen language speech 2019 (OOD)	WER5.6	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord