Fish Audio S2 Technical Report
About
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | X-Voice (test) | WER1.37 | 186 | |
| Subjective Speech Quality Evaluation | X-Voice (test) | IMOS4.8 | 156 | |
| Zero-shot Text-to-Speech | Seed-TTS en (test) | WER1.37 | 25 | |
| Text-to-Speech | EmergentTTS (eval) | Overall WER8.15 | 25 | |
| Text-to-Speech | InstructTTSEval ZH | APS29.61 | 24 | |
| Voice Cloning | CV3-Eval multilingual voice cloning (test) | WER (zh)2.65 | 18 | |
| Voice Cloning | Seed-TTS en (test) | WER0.99 | 16 | |
| Speaker Similarity | 51 speaker prompts Emotion Control evaluation | Speaker Similarity0.5731 | 10 | |
| Speech Synthesis | Audio Turing Test (ATT) | Mean ATT Score51.5 | 8 | |
| Voice-cloning intelligibility | Seed-TTS-Eval zh (test) | WER0.54 | 8 |