Fish Audio S2 Technical Report

About

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	X-Voice (test)	WER1.37	186
Subjective Speech Quality Evaluation	X-Voice (test)	IMOS4.8	156
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.37	25
Text-to-Speech	EmergentTTS (eval)	Overall WER8.15	25
Text-to-Speech	InstructTTSEval ZH	APS29.61	24
Voice Cloning	CV3-Eval multilingual voice cloning (test)	WER (zh)2.65	18
Voice Cloning	Seed-TTS en (test)	WER0.99	16
Speaker Similarity	51 speaker prompts Emotion Control evaluation	Speaker Similarity0.5731	10
Speech Synthesis	Audio Turing Test (ATT)	Mean ATT Score51.5	8
Voice-cloning intelligibility	Seed-TTS-Eval zh (test)	WER0.54	8

Showing 10 of 20 rows

Other info

GitHub

Follow for update

@wizwand_team Discord