Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis
About
Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Sentiment Analysis | MOSEI | MAE0.55 | 168 | |
| Multimodal Sentiment Analysis | CMU-MOSI | -- | 144 | |
| Multimodal Sentiment Analysis | MOSI | MAE0.721 | 132 | |
| Multimodal Sentiment Analysis | CH-SIMS (test) | F1 Score81.57 | 108 | |
| Multimodal Sentiment Analysis | SIMS (test) | Accuracy (2-Class)81.91 | 78 | |
| Multimodal Sentiment Analysis | MOSEI (test) | MAE0.526 | 49 | |
| Multimodal Sentiment Analysis | MOSI (test) | MAE0.683 | 34 | |
| Multimodal Sentiment Analysis | CH-SIMS | F1 Score77.6 | 32 | |
| Multimodal Sentiment Analysis | SIMS V2 | Accuracy (2-class)79.59 | 17 | |
| Multimodal Sentiment Analysis | SIMS | MAE0.408 | 10 |