Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

About

Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.

Yuhang Dai, Haopeng Lin, Zhennan Lin, Jiale Qian, Jun Wu, Hanke Xie, Hao Meng, Hanlin Wen, Chuang Ding, Shunshun Yin, Ming Tao, Lei Xie, Xinsheng Wang• 2026

Related benchmarks

TaskDatasetResultRank
Multi-speaker speech transcription and diarizationAliMeeting 5-minute long-form (test)
DER5.72
4
Multi-speaker speech transcription and diarizationAISHELL-4 5-minute long-form (test)
DER7.73
4
Speaker-attributed Speech RecognitionAISHELL-4 Short-form
DER2.89
4
Speaker-attributed Speech RecognitionAliMeeting Short-form
DER5.39
4
Speaker-attributed Speech RecognitionAMI-SDM Short-form
DER11.67
4
Multi-speaker TranscriptionDaily Conversation internal (test)
DER1.32
3
Multi-speaker TranscriptionMovies internal (test)
DER23.56
3
Multi-speaker TranscriptionPodcast internal (test)
DER21.15
3
Showing 8 of 8 rows

Other info

Follow for update