General Facial Representation Learning in a Visual-Linguistic Manner
About
How to learn a universal facial representation that boosts all face analysis tasks? This paper takes one step toward this goal. In this paper, we study the transfer performance of pre-trained models on face analysis tasks and introduce a framework, called FaRL, for general Facial Representation Learning in a visual-linguistic manner. On one hand, the framework involves a contrastive loss to learn high-level semantic meaning from image-text pairs. On the other hand, we propose exploring low-level information simultaneously to further enhance the face representation, by adding a masked image modeling. We perform pre-training on LAION-FACE, a dataset containing large amount of face image-text pairs, and evaluate the representation capability on multiple downstream tasks. We show that FaRL achieves better transfer performance compared with previous pre-trained models. We also verify its superiority in the low-data regime. More importantly, our model surpasses the state-of-the-art methods on face analysis tasks including face parsing and face alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Facial Expression Recognition | RAF-DB (test) | Accuracy88.31 | 180 | |
| Facial Landmark Detection | 300-W (Common) | -- | 180 | |
| Facial Landmark Detection | 300-W (Fullset) | Mean Error (%)2.93 | 174 | |
| Facial Attribute Classification | CelebA | Accuracy91.88 | 163 | |
| Facial Landmark Detection | 300W (Challenging) | -- | 159 | |
| Face Alignment | WFLW (test) | NME (%) (Testset)4.03 | 144 | |
| Facial Landmark Detection | WFLW (test) | Mean Error (ME) - All3.99 | 122 | |
| Facial Expression Recognition | AffectNet 7-way (test) | Accuracy64.85 | 91 | |
| Facial Attribute Classification | CelebA (test) | Average Acc91.88 | 89 | |
| Face Alignment | 300W Fullset (test) | NME3.08 | 82 |