Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

About

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.

Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang• 2026

Related benchmarks

TaskDatasetResultRank
Speaker-attributed Automatic Speech RecognitionAliMeeting
Word Error Rate (WER)20.09
7
Speaker-attributed Automatic Speech RecognitionAMI SDM
WER17.49
7
Speaker-attributed Automatic Speech RecognitionAISHELL-4
WER21.36
6
Showing 3 of 3 rows

Other info

Follow for update