Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models

About

Self-supervised learning (SSL) models such as WavLM have substantially advanced speaker diarization by providing rich contextual speech representations. However, the high computational and memory costs of these models hinder deployment in real-time and resource-constrained scenarios. This work presents a systematic study on compressing SSL-based diarization models through structured pruning guided by knowledge distillation. We investigate pruning objectives that target both model parameters and computational complexity, and analyze alternative strategies, showing that a simple overall pruning approach provides the best balance between efficiency and accuracy. Our method achieves up to 80% model size reduction and 4x faster inference without performance degradation. Comprehensive experiments across eight public diarization datasets demonstrate that the pruned models consistently match or surpass the performance of their uncompressed counterparts. Furthermore, we show strong out-of-domain generalization on the CHiME-6 dataset, achieving accuracy comparable to the top systems in the CHiME-7 challenge without any domain adaptation. These results highlight that structured pruning, when guided by distillation, can yield efficient and generalizable diarization systems suitable for real-world applications.

Jiangyu Han, Petr P\'alka, Marc Delcroix, Federico Landini, Johan Rohdin, Jan Cernock\'y, Luk\'a\v{s} Burget• 2025

Related benchmarks

Task	Dataset	Result	Rank
Speaker Diarization	AliMeeting (test)	DER0.108		13

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord