SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription
About
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-speaker speech transcription and diarization | AliMeeting 5-minute long-form (test) | DER5.72 | 4 | |
| Multi-speaker speech transcription and diarization | AISHELL-4 5-minute long-form (test) | DER7.73 | 4 | |
| Speaker-attributed Speech Recognition | AISHELL-4 Short-form | DER2.89 | 4 | |
| Speaker-attributed Speech Recognition | AliMeeting Short-form | DER5.39 | 4 | |
| Speaker-attributed Speech Recognition | AMI-SDM Short-form | DER11.67 | 4 | |
| Multi-speaker Transcription | Daily Conversation internal (test) | DER1.32 | 3 | |
| Multi-speaker Transcription | Movies internal (test) | DER23.56 | 3 | |
| Multi-speaker Transcription | Podcast internal (test) | DER21.15 | 3 |