SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

About

Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.

Yuhang Dai, Haopeng Lin, Zhennan Lin, Jiale Qian, Jun Wu, Hanke Xie, Hao Meng, Hanlin Wen, Chuang Ding, Shunshun Yin, Ming Tao, Lei Xie, Xinsheng Wang• 2026

Related benchmarks

Task	Dataset	Result
Multi-speaker speech transcription and diarization	AliMeeting 5-minute long-form (test)	DER5.72	4
Multi-speaker speech transcription and diarization	AISHELL-4 5-minute long-form (test)	DER7.73	4
Speaker-attributed Speech Recognition	AISHELL-4 Short-form	DER2.89	4
Speaker-attributed Speech Recognition	AliMeeting Short-form	DER5.39	4
Speaker-attributed Speech Recognition	AMI-SDM Short-form	DER11.67	4
Multi-speaker Transcription	Daily Conversation internal (test)	DER1.32	3
Multi-speaker Transcription	Movies internal (test)	DER23.56	3
Multi-speaker Transcription	Podcast internal (test)	DER21.15	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord