Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

End-to-End Neural Speaker Diarization with Permutation-Free Objectives

About

In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachi-speech/EEND.

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe• 2019

Related benchmarks

TaskDatasetResultRank
Speaker DiarizationCALLHOME (test)
DER (%)23.07
33
Speaker DiarizationSimulated speech mixtures β = 2 (test)
DER12.28
9
Speaker DiarizationSimulated beta = 3 (test)
DER14.36
6
Speaker DiarizationSimulated beta = 5 (test)
DER19.69
6
Speaker DiarizationCSJ (test)
DER25.37
6
Speaker DiarizationCALLHOME overlap ratio 11.8%
DER23.07
4
Speaker DiarizationSimulated mixtures beta=2, overlap ratio 27.3%
DER12.28
3
Speaker DiarizationSimulated mixtures beta=3, overlap ratio 19.1%
DER14.36
3
Speaker DiarizationSimulated mixtures beta=5, overlap ratio 11.1%
DER19.69
3
Showing 9 of 9 rows

Other info

Code

Follow for update