Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

About

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, Tianshu Yu• 2023

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	MOSEI	MAE0.55	183
Multimodal Sentiment Analysis	CMU-MOSI	--	166
Multimodal Sentiment Analysis	MOSI	MAE0.721	132
Multimodal Sentiment Analysis	CH-SIMS (test)	F1 Score81.57	108
Multimodal Sentiment Analysis	SIMS (test)	Accuracy (2-Class)81.91	78
Binary Sentiment Classification	CMU-MOSI (test)	F1 Score84.81	76
Multimodal Sentiment Analysis	CMU-MOSEI standard (test)	Accuracy84.55	65
Multimodal Sentiment Analysis	MOSEI (test)	MAE0.526	49
Multimodal Sentiment Analysis	MOSI (test)	MAE0.683	34
Multimodal Sentiment Analysis	CH-SIMS	F1 Score77.6	32

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord