Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

About

This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %.

Valentin Vielzeuf, St\'ephane Pateux, Fr\'ed\'eric Jurie• 2017

Related benchmarks

TaskDatasetResultRank
Facial Expression RecognitionAFEW 8.0 (test)
Accuracy48.6
20
Video-based Facial Expression RecognitionAFEW 8.0 (val)
Accuracy0.486
12
Emotion RecognitionAFEW 9 (val)
Accuracy48.6
8
Showing 3 of 3 rows

Other info

Follow for update