Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-supervised learning of a facial attribute embedding from video

About

We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.

Olivia Wiles, A. Sophia Koepke, Andrew Zisserman• 2018

Related benchmarks

TaskDatasetResultRank
Facial Expression RecognitionFER 2013 (test)
Accuracy Rate46.98
61
Facial Action Unit DetectionDISFA
F1 (AU 1)15.5
47
Landmark PredictionMAFL (test)
Mean Error (%)3.44
38
Facial Landmark DetectionMAFL (test)
Normalised MSE (%)3.44
30
Landmark RegressionMAFL (test)
MSE (%)3.44
28
Expression ClassificationAffectNet (val)
Average Accuracy76.4
20
Facial Expression RecognitionRAF-DB 1.0 (test)
Accuracy66.72
18
Landmark Prediction300-W (test)
Landmark Prediction Error5.71
12
3D Pose EstimationAFLW (test)
MAE7.65
11
Landmark DetectionMAFL (test)
Inter-ocular Distance Error (%)3.44
10
Showing 10 of 13 rows

Other info

Follow for update