VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
About
Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.75 | 1207 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.23 | 1206 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | -- | 78 | |
| Text-to-Speech | Seed-TTS EN | WER2.2 | 32 | |
| Automatic Speech Recognition | AISHELL (test) | CER1.64 | 26 | |
| Automatic Speech Recognition | WenetSpeech (test_net) | WER6.42 | 13 | |
| Spoken Dialogue Evaluation | C3 ZH | Phonetic Error9.19 | 7 | |
| Spoken Language Understanding and Dialogue | URO English | Understanding89.91 | 7 | |
| Spoken Dialogue Evaluation | C3 EN | Phonetic Score41.38 | 7 | |
| Spoken Language Understanding and Dialogue | URO Chinese | Understanding Score89.76 | 6 |