Mamba-based Segmentation Model for Speaker Diarization
About
Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba's stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker Diarization | AISHELL-4 | DER (%)10.5 | 20 | |
| Speaker Diarization | RAMC | DER11 | 9 | |
| Speaker Diarization | AliMeeting far | DER16.2 | 6 | |
| Speaker Diarization | VoxConverse v0.3 | DER (%)0.093 | 5 | |
| Speaker Diarization | AMI Channel 1 | DER (%)18.5 | 5 | |
| Speaker Diarization | MSDWild Few | DER (%)19.8 | 4 |