FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

About

Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER1.95	159
Text-to-Speech	Seed-TTS zh (test)	WER1.14	87
Voice Cloning	Seed-TTS en (test)	WER1.95	53
Text-to-Speech	Seed-TTS (eval)	WER1.95	39
Voice Cloning	Seed-TTS-Eval zh (test)	CER1.14	37
Text-to-Speech	Seed-TTS Seed-EN (test)	WER0.0195	32
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER2.187	25
Text-to-Speech	English (test)	WER0.0195	21
Text-to-Speech	Chinese standard (test)	CER1.14	21
Text-to-Speech	Seed-TTS Seed-ZH (Evaluation)	CER1.14	16

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord