DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification
About
Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker Verification | VoxCeleb1 (Vox1-O) | -- | 33 | |
| Speaker Verification | VOiCES (s-avg) | EER10.25 | 30 | |
| Speaker Verification | VOiCES 5s-1s | EER15.53 | 30 | |
| Speaker Verification | VOiCES f-f | EER0.0418 | 30 | |
| Speaker Verification | VoxCeleb1 (Vox1-H) | -- | 20 | |
| Speaker Verification | VoxCeleb-E | EER (f-f)0.97 | 15 | |
| Speaker Verification | VoxCeleb Extended 1 | EER (f-f)1.04 | 15 | |
| Speaker Verification | VoxCeleb Hard 1 | EER (f-f)1.98 | 15 |