Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fish Audio S2 Technical Report

About

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechX-Voice (test)
WER1.37
186
Subjective Speech Quality EvaluationX-Voice (test)
IMOS4.8
156
Zero-shot Text-to-SpeechSeed-TTS en (test)
WER1.37
25
Text-to-SpeechEmergentTTS (eval)
Overall WER8.15
25
Text-to-SpeechInstructTTSEval ZH
APS29.61
24
Voice CloningCV3-Eval multilingual voice cloning (test)
WER (zh)2.65
18
Voice CloningSeed-TTS en (test)
WER0.99
16
Speaker Similarity51 speaker prompts Emotion Control evaluation
Speaker Similarity0.5731
10
Speech SynthesisAudio Turing Test (ATT)
Mean ATT Score51.5
8
Voice-cloning intelligibilitySeed-TTS-Eval zh (test)
WER0.54
8
Showing 10 of 20 rows

Other info

GitHub

Follow for update