DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification

About

Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.

Youngmoon Jung, Joon-Young Yang, Ju-ho Kim, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho• 2026

Related benchmarks

Task	Dataset	Result
Speaker Verification	VoxCeleb1 (Vox1-O)	--	105
Speaker Verification	VoxCeleb1 (Vox1-H)	--	70
Speaker Verification	VoxCeleb-E	--	62
Speaker Verification	VOiCES (s-avg)	EER10.25	30
Speaker Verification	VOiCES 5s-1s	EER15.53	30
Speaker Verification	VOiCES f-f	EER0.0418	30
Speaker Verification	VoxCeleb Extended 1	EER (f-f)1.04	15
Speaker Verification	VoxCeleb Hard 1	EER (f-f)1.98	15

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord