Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

About

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han• 2025

Related benchmarks

Task	Dataset	Result
Speaker Verification	VoxCeleb1 (Vox1-O)	EER37	105
Speaker Verification	VoxCeleb1 (Vox1-H)	EER1.01	70
Speaker Verification	VoxCeleb-E	EER0.5	62
Speaker Verification	VoxCeleb1-O Cleaned (Original)	EER (%)0.37	61
Speaker Verification	VoxCeleb1 Cleaned (Extended)	EER (%)0.5	45
Speaker Verification	VoxCeleb1 Hard Cleaned	EER0.0101	45

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord