SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation
About
Recent advances in spoken dialogue systems have brought increased attention to human-like full-duplex voice interactions. However, our comprehensive review of this field reveals several challenges, including the difficulty in obtaining training data, catastrophic forgetting, and limited scalability. In this work, we propose SoulX-Duplug, a plug-and-play streaming state prediction module for full-duplex spoken dialogue systems. By jointly performing streaming ASR, SoulX-Duplug explicitly leverages textual information to identify user intent, effectively serving as a semantic VAD. To promote fair evaluation, we introduce SoulX-Duplug-Eval, extending widely used benchmarks with improved bilingual coverage. Experimental results show that SoulX-Duplug enables low-latency streaming dialogue state control, and the system built upon it outperforms existing full-duplex models in overall turn management and latency performance. We have open-sourced SoulX-Duplug and SoulX-Duplug-Eval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| User Interruption | Bilingual Full-Duplex-Bench English | RL1.03 | 12 | |
| Overall Evaluation | Bilingual Full-Duplex-Bench English | Accuracy81.2 | 8 | |
| Turn Taking | Bilingual Full-Duplex-Bench English | TOR93.3 | 6 | |
| User Backchannel | Bilingual Full-Duplex-Bench English | RsR74 | 6 | |
| Pause Handling | Bilingual Full-Duplex-Bench English | TOR35.2 | 6 | |
| User Interruption | Bilingual Full-Duplex-Bench Chinese | RL1.15 | 4 | |
| Overall Evaluation | Bilingual Full-Duplex-Bench Chinese | Accuracy91.6 | 2 | |
| Turn Taking | Full-Duplex-Bench Bilingual Chinese | TOR99.4 | 2 | |
| User Backchannel | Bilingual Full-Duplex-Bench Chinese | RsR80 | 2 | |
| Pause Handling | Bilingual Full-Duplex-Bench Chinese | TOR3.8 | 2 |