Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sylber 2.0: A Universal Syllable Embedding

About

Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.

Cheol Jun Cho, Nicholas Lee, Alan W Black, Gopala K. Anumanchipalli• 2026

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibriTTS (test-other)
UTMOS3.54
44
Universal Speech Representation EvaluationSUPERB Benchmark
SID Accuracy0.7203
27
Text-to-SpeechLibriSpeech PC clean (test)
WER2.35
12
Text-to-SpeechSeedTTS English (test)
WER1.92
12
Automatic Speech RecognitionKorean (ko) low-resource
CER7.2
7
Automatic Speech RecognitionBemba (bem) low-resource
CER0.121
7
Automatic Speech RecognitionQuechua (que) low-resource
CER30.1
7
Speech ResynthesisFLEURS-R Spanish (test)
WER3.18
7
Speech ResynthesisFLEURS-R 20 Languages (test)
WER7.57
7
Singing Voice ResynthesisGTSinger (test)
F0-PCC0.96
7
Showing 10 of 11 rows

Other info

Follow for update