Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
About
Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Sentiment Analysis | MOSEI | -- | 168 | |
| Emotion Recognition | IEMOCAP | -- | 115 | |
| Multimodal Sentiment Analysis | CH-SIMS (test) | F1 Score75.4 | 108 | |
| Dynamic Facial Expression Recognition | DFEW | WAR77.06 | 47 | |
| Multimodal Emotion Recognition in Conversation | MELD | Weighted Avg F1 Score46.76 | 36 | |
| Multimodal Emotion Recognition | MER 2023 | F1 Score90.36 | 30 | |
| Textual Emotion Analysis | DFEC 2000 (test) | VideoChatGPT Score (Overall)7.42 | 28 | |
| Emotion Cognition and Reasoning | HitEmotion ECR level 1.0 (test) | EER42.81 | 23 | |
| Emotion Understanding and Analysis | HitEmotion | DPTM (MF)39.54 | 23 | |
| Emotion Perception and Recognition | HitEmotion Level 1 | FESD33.11 | 23 |