Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DNCASR: End-to-End Training for Speaker-Attributed ASR

About

This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained under a unified loss function. By employing a serialised training approach, DNCASR effectively addresses overlapping speech in real-world meetings, where the link improves the prediction of speaker indices in overlapping segments. Experiments on the AMI-MDM meeting corpus demonstrate that the jointly trained DNCASR outperforms a parallel system that does not have links between the speaker and ASR decoders. Using cpWER to measure the speaker-attributed word error rate, DNCASR achieves a 9.0% relative reduction on the AMI-MDM Eval set.

Xianrui Zheng, Chao Zhang, Philip C. Woodland• 2025

Related benchmarks

TaskDatasetResultRank
Speaker-attributed TranscriptionAMI-MDM (dev)
cpWER30.7
5
Speaker-attributed TranscriptionAMI MDM (eval)
cpWER31.5
5
Speaker-attributed Automatic Speech RecognitionSynthetic LibriSpeech meetings 960h (train)
WER3.5
3
Diarization-aware Automatic Speech RecognitionSynthetic meetings (Ours)
cpWER (%)8.7
2
Diarization-aware Automatic Speech RecognitionLibriCSS OV10 (anechoic) (session)
cpWER (%)7.6
2
Showing 5 of 5 rows

Other info

Follow for update