MOSS-TTS Technical Report

About

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	X-Voice (test)	WER2.91	186
Subjective Speech Quality Evaluation	X-Voice (test)	IMOS4.55	156
Audio Reconstruction	AudioSet (eval)	Mel Distance0.68	63
Speech Reconstruction	LibriSpeech English (test-clean)	SIM0.97	54
Speech Reconstruction	AISHELL-2 Chinese	SIM0.93	54
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.92	25
Text-to-Speech	Seed-ZH	CER1.2	23
Text-to-Speech	Seed EN	WER1.85	22
Music Reconstruction	MUSDB	Mel-Loss0.64	18
Voice Cloning	CV3-Eval multilingual voice cloning (test)	WER (zh)3.68	18

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord