Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

About

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han• 2025

Related benchmarks

TaskDatasetResultRank
Speaker VerificationVoxCeleb1-O Cleaned (Original)
EER (%)0.37
53
Speaker VerificationVoxCeleb1 Cleaned (Extended)
EER (%)0.5
45
Speaker VerificationVoxCeleb1 Hard Cleaned
EER0.0101
45
Showing 3 of 3 rows

Other info

Follow for update