MOSS-TTS Technical Report
About
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Reconstruction | AudioSet (eval) | Mel Distance0.68 | 63 | |
| Speech Reconstruction | LibriSpeech English (test-clean) | SIM0.97 | 54 | |
| Speech Reconstruction | AISHELL-2 Chinese | SIM0.93 | 54 | |
| Text-to-Speech | Seed-ZH | CER1.2 | 23 | |
| Text-to-Speech | Seed EN | WER1.85 | 22 | |
| Music Reconstruction | MUSDB | Mel-Loss0.64 | 18 | |
| Voice Cloning | CV3-Eval multilingual voice cloning (test) | WER (zh)3.68 | 18 |