Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding

About

Contrastive learning has shown promising potential for learning robust representations by utilizing unlabeled data. However, constructing effective positive-negative pairs for contrastive learning on facial behavior datasets remains challenging. This is because such pairs inevitably encode the subject-ID information, and the randomly constructed pairs may push similar facial images away due to the limited number of subjects in facial behavior datasets. To address this issue, we propose to utilize activity descriptions, coarse-grained information provided in some datasets, which can provide high-level semantic information about the image sequences but is often neglected in previous studies. More specifically, we introduce a two-stage Contrastive Learning with Text-Embeded framework for Facial behavior understanding (CLEF). The first stage is a weakly-supervised contrastive learning method that learns representations from positive-negative pairs constructed using coarse-grained activity information. The second stage aims to train the recognition of facial expressions or facial action units by maximizing the similarity between image and the corresponding text label names. The proposed CLEF achieves state-of-the-art performance on three in-the-lab datasets for AU recognition and three in-the-wild datasets for facial expression recognition.

Xiang Zhang, Taoyue Wang, Xiaotian Li, Huiyuan Yang, Lijun Yin• 2023

Related benchmarks

TaskDatasetResultRank
Facial Expression RecognitionRAF-DB (test)
Accuracy90.09
180
Facial Expression RecognitionAffectNet 7-way (test)
Accuracy65.66
91
Facial Expression RecognitionAffectNet 8-way (test)
Accuracy62.77
65
Facial Expression RecognitionRAF-DB
Accuracy90.09
53
Facial Action Unit DetectionDISFA
F1 (AU 1)64.3
47
Facial Action Unit DetectionDISFA (test)
Avg AU Score64.8
45
Action Unit DetectionBP4D
Average F1 Score65.9
43
Action Unit DetectionDISFA--
21
Action Unit DetectionBP4D (5-fold cross-val)
Average Performance65.9
14
Facial Expression RecognitionFERPlus basic emotion (test)
Top-1 Accuracy89.74
12
Showing 10 of 12 rows

Other info

Follow for update