Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

About

Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

Chunlei Meng, Ziyang Zhou, Lucas He, Xiaojing Du, Chun Ouyang, Zhongxue Gan• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal Sentiment AnalysisCMU-MOSI segments (test)
ACC286.5
22
Multimodal Sentiment AnalysisCMU-MOSEI segments (test)
ACC286.4
22
Showing 2 of 2 rows

Other info

Follow for update