Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fish Audio S2 Technical Report

About

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechEmergentTTS (eval)
Overall WER8.15
25
Text-to-SpeechInstructTTSEval ZH
APS29.61
24
Voice CloningCV3-Eval multilingual voice cloning (test)
WER (zh)2.65
18
Voice CloningSeed-TTS en (test)
WER0.99
16
Speech SynthesisAudio Turing Test (ATT)
Mean ATT Score51.5
8
Voice-cloning intelligibilitySeed-TTS-Eval zh (test)
WER0.54
8
Single-utterance Voice DesignHuman Evaluation set for single-utterance voice design
Overall Score2.07
5
Speech GenerationLong-Audio benchmark English
WER4.38
4
Speech GenerationLong-Audio benchmark Chinese
CER5.95
4
Voice-cloning intelligibilitySeed-TTS-Eval (zh-hard)
WER5.99
4
Showing 10 of 12 rows

Other info

GitHub

Follow for update