VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

About

Compared to images, videos better reflect real-world acquisition and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos due to the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion and the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. To this end, we construct M3SVD, a benchmark dataset with $220$ temporally synchronized and spatially registered infrared-visible videos comprising $153,797$ frames, bridging the data gap. Secondly, we propose VideoFusion, a multi-modal video fusion model that exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference. Project and M3SVD: https://github.com/Linfeng-Tang/VideoFusion.

Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, Hao Zhang, Jiayi Ma• 2025

Related benchmarks

Task	Dataset	Result
Video Fusion	VTMOT	QG24.81	13
Multi-modal Video Fusion	M3SVD	Parameters (M)6.743	12
Multi-modal Video Fusion	M3SVD (normal scenarios)	EN7.199	10
Video Fusion	M3SVD degraded scenarios	EN7.167	10
Video Fusion	HDO degraded scenarios	EN7.288	10
Infrared and Visible Video Fusion	HDO	QMI0.408	8
Infrared and Visible Video Fusion	M3SVD	QMI54.15	8
Infrared and Visible Video Fusion	VTMOT	QMI0.4018	8
Video Fusion	M3SVD	QG38.81	3
Video Fusion	HDO	QG0.3649	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord