Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

About

Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

Chunlei Meng, Ziyang Zhou, Lucas He, Xiaojing Du, Chun Ouyang, Zhongxue Gan• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	CMU-MOSEI (test)	F1 Score85.9	401
Multimodal Sentiment Analysis	CMU-MOSI (test)	F186.2	385
Multimodal Intent Recognition	MIntRec	Accuracy72.59	27
Multimodal Sentiment Analysis	CMU-MOSI segments (test)	ACC286.5	22
Multimodal Sentiment Analysis	CMU-MOSEI segments (test)	ACC286.4	22

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord