EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

About

Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	--	126
Novel View Synthesis	iPhone dataset	SSIM0.479	33
Novel View Synthesis	Droid, BridgeData V2, and RoboCoin (test)	PSNR12.72	7
Camera-controlled Video Generation	Koala	RotErr0.637	6
Click Alarmclock	RoboTwin 0° viewpoint	Success Rate72	6
Camera control and 3D consistency	iPhone dataset	Translation Error1.325	6
Click Bell	RoboTwin 0° viewpoint	Success Rate9	6
Video Reshooting	DAVIS and Pexels 110 video-camera pairs (user study)	Source Preservation1.587	6
Video Reshooting	110 video-camera pairs evaluation dataset (DAVIS and Pexels)	FID124.6	6
Generative Video Synthesis	RoboTwin	PSNR (dB)17.031	5

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord