Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis
About
Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annotation, existing methods are restricted in capturing differentiated information. However, additional uni-modal annotations are high time- and labor-cost. In this paper, we design a label generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multi-modal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, during the training stage, we design a weight-adjustment strategy to balance the learning progress among different subtasks. That is to guide the subtasks to focus on samples with a larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the reliability and stability of auto-generated unimodal supervisions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human-annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Sentiment Analysis | CMU-MOSI (test) | F185.95 | 238 | |
| Multimodal Sentiment Analysis | CMU-MOSEI (test) | F1 Score85.2 | 206 | |
| Multimodal Sentiment Analysis | CMU-MOSI | MAE0.712 | 59 | |
| Multimodal Sentiment Analysis | MOSEI (test) | MAE0.529 | 49 | |
| Emotion Recognition | IEMOCAP (test) | Score (l)0.687 | 36 | |
| Multimodal Sentiment Analysis | MOSI (test) | MAE0.712 | 34 | |
| Multimodal Sentiment Analysis | CH-SIMS V2 | Accuracy (2-Class)78.7 | 29 | |
| Emotion Recognition (ER) Valence and Arousal Regression | EMER (test) | Arousal MAE0.244 | 26 | |
| Multimodal Sentiment Analysis | SIMS (test) | MAE0.458 | 22 | |
| Multimodal Sentiment Analysis | CMU-MOSEI segments (test) | ACC285.3 | 22 |