Multimodal Fusion via Self-Consistent Task-Gradient Fields

About

Multimodal learning aims to preserve as much task-related information as possible from different inputs. However, current fusion designs often distort the feedback loop to feature extractors. Aggressively merging modalities entangles their representations, making the feature extractors fragile to incomplete inputs. Meanwhile, attempting to separate features via auxiliary losses frequently introduces optimization conflicts that distract from the primary task. We propose the Self-Consistent Field Autoencoder (SCFAE) to provide a better path for task gradients. Our method follows the self-consistent field principle to balance task learning with feature organization, thereby minimizing mutual information. We use small autoencoders for each modality to keep information intact. The task loss acts as a driving force to select predictive features. The reconstruction loss acts as a constraint to separate these features into independent subspaces. These dual objectives operate through complementary feature subspaces, thereby mitigating optimization interference. We evaluate SCFAE on audio-visual-text, audio-visual, and image-video benchmarks. Results show that SCFAE handles missing data and unequal input sizes more robustly via a simple structure. Gradient analysis confirms that SCFAE avoids conflicts and maintains stable training dynamics.

Jiayu Xiong, Jing Wang, Jun Xue, Wanlong Wang, Jianlong Kwan, Xiaosen Lyu, Zhouqiang Jiang• 2024

Related benchmarks

Task	Dataset	Result	Rank
Image-Video Retrieval	ActivityNet	mAP@1036.7		22
Audio-Visual Deepfake Detection	FakeAVCeleb (test)	ACC (Audio-Visual)97.68		16

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord