MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

About

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, Yu-Shen Liu• 2026

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)64.5	235
Camera pose estimation	TUM-dynamic	ATE0.0115	205
Camera pose estimation	Sintel	ATE0.0877	203
Video Depth Estimation	KITTI	Abs Rel0.066	153
Video Depth Estimation	BONN	AbsRel5.5	139
Video Depth Estimation	TUM dynamics	Abs Rel0.12	61
Camera pose estimation	CO3D v2 (test)	AUC@3083	61
Multi-view Stereo Reconstruction	ETH3D (test)	Accuracy34.8	48
Camera pose estimation	ScanNet static indoor scenes	ATE0.0375	40
Pose Estimation	BONN	ATE0.0138	38

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord