Fish Audio S2 Technical Report
About
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | EmergentTTS (eval) | Overall WER8.15 | 25 | |
| Text-to-Speech | InstructTTSEval ZH | APS29.61 | 24 | |
| Voice Cloning | CV3-Eval multilingual voice cloning (test) | WER (zh)2.65 | 18 | |
| Voice Cloning | Seed-TTS en (test) | WER0.99 | 16 | |
| Speech Synthesis | Audio Turing Test (ATT) | Mean ATT Score51.5 | 8 | |
| Voice-cloning intelligibility | Seed-TTS-Eval zh (test) | WER0.54 | 8 | |
| Single-utterance Voice Design | Human Evaluation set for single-utterance voice design | Overall Score2.07 | 5 | |
| Speech Generation | Long-Audio benchmark English | WER4.38 | 4 | |
| Speech Generation | Long-Audio benchmark Chinese | CER5.95 | 4 | |
| Voice-cloning intelligibility | Seed-TTS-Eval (zh-hard) | WER5.99 | 4 |