Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learnable PINs: Cross-Modal Embeddings for Person Identity

About

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Arsha Nagrani, Samuel Albanie, Andrew Zisserman• 2018

Related benchmarks

TaskDatasetResultRank
Cross-modal verificationVoxCeleb1 (Unseen-Unheard)
AUC78.5
13
Cross-modal verificationVoxCeleb1 (Seen-Heard)
AUC0.87
9
Face-voice cross-modal verificationVOX1 (test)
AUC84.7
6
Showing 3 of 3 rows

Other info

Follow for update