Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models
About
We propose an approach for simultaneous diarization and separation of meeting data. It consists of a complex Angular Central Gaussian Mixture Model (cACGMM) for speech source separation, and a von-Mises-Fisher Mixture Model (VMFMM) for diarization in a joint statistical framework. Through the integration, both spatial and spectral information are exploited for diarization and separation. We also develop a method for counting the number of active speakers in a segment of a meeting to support block-wise processing. While the total number of speakers in a meeting may be known, it is usually not known on a per-segment level. With the proposed speaker counting, joint diarization and source separation can be done segment-by-segment, and the permutation problem across segments is solved, thus allowing for block-online processing in the future. Experimental results on the LibriCSS meeting corpus show that the integrated approach outperforms a cascaded approach of diarization and speech enhancement in terms of WER, both on a per-segment and on a per-meeting level.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Joint Diarization and Speech Separation | LibriCSS concatenated segments static scenario | cpWER (0S)4.2 | 5 | |
| Joint Diarization and Speech Separation | LibriCSS concatenated segments (speaker relocation scenario) | cpWER (0S)17.2 | 5 | |
| Meeting Recognition | LibriCSS individual segments | Error Rate (0S)4.3 | 4 |