LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

About

We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	X-Voice (test)	WER3.55	186
Subjective Speech Quality Evaluation	X-Voice (test)	IMOS4.52	156
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.49	25
Speech Editing (Substitution)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.05	14
Speech Editing (Deletion)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.03	14
Speech Editing (Deletion)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.02	14
Speech Editing (Insertion)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.05	14
Speech Editing (Insertion)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.04	14
Speech Editing (Substitution)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.05	14
Zero-shot Text-to-Speech	Seed-TTS zh (test)	WER3.34	8

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord