Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
About
Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Sentiment Analysis | CMU-MOSI (test) | -- | 385 | |
| Multimodal Sentiment Analysis | MOSEI | -- | 183 | |
| Emotion Recognition | IEMOCAP | -- | 151 | |
| Multimodal Sentiment Analysis | CH-SIMS (test) | F1 Score75.4 | 108 | |
| Sentiment Analysis | CMU-MOSEI (test) | -- | 96 | |
| Emotion Recognition | MELD (test) | Weighted F146.76 | 89 | |
| Emotion Classification | IEMOCAP (test) | Weighted-F155.47 | 61 | |
| Dynamic Facial Expression Recognition | DFEW | WAR77.06 | 47 | |
| Multimodal Emotion Recognition in Conversation | MELD | Weighted Avg F1 Score46.76 | 36 | |
| Emotion Recognition | MER-UniBench (test) | MER2359.38 | 35 |