Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
About
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER100.1 | 833 | |
| Text-to-Speech | Seed-TTS en (test) | WER1.98 | 50 | |
| Speech Reconstruction | Librispeech (test-clean) | STOI0.92 | 49 | |
| Text-to-Speech | Seed-TTS zh (test) | WER1.2 | 47 | |
| Text-to-Speech | Seed-TTS (eval) | WER1.98 | 39 | |
| Voice Conversion | VCTK | WER1.2 | 21 | |
| Text-to-Speech | Chinese standard (test) | CER1.54 | 21 | |
| Text-to-Speech | English (test) | WER0.0314 | 21 | |
| Speech Reconstruction | SeedTTS en (test) | WER0.0305 | 18 | |
| Speech Reconstruction | Salmon Sentiment Consistency emotional 2025b (OOD) | WER5.4 | 18 |