Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
About
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER3.11 | 833 | |
| Automatic Speech Recognition | LibriSpeech Other | WER8.44 | 75 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER3.11 | 57 | |
| Text-to-Speech | LibriSpeech clean (test) | WER2.8 | 50 | |
| General Audio Understanding | VoiceBench | AlpacaEval Score4.13 | 16 | |
| Telecom Fraud Analysis | TeleAntiFraud-Bench | Weighted F1 (Sce.)76.35 | 15 | |
| Automatic Speech Recognition | AISHELL-2 | ZH-CER3.6 | 9 | |
| Voice Cloning | Common Voice English | SIM Score0.66 | 7 |