Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

About

Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP.

Zengqun Zhao, Yu Cao, Shaogang Gong, Ioannis Patras• 2024

Related benchmarks

TaskDatasetResultRank
Facial Expression RecognitionAffectNet 7 classes
Accuracy44.27
23
Facial Expression RecognitionBioVid
WAR70.2
13
Facial Expression RecognitionBAH
WAR62.2
13
Facial Expression RecognitionStressID
WAR63.1
13
Facial Expression RecognitionBioVid Part A (target)
Weighted Accuracy Rate (WAR)70.2
12
Facial Expression RecognitionBAH (target)
WAR62.2
12
Facial Expression RecognitionStressID (target)
WAR63.1
12
Showing 7 of 7 rows

Other info

Follow for update