EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh
About
Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | VBench | -- | 126 | |
| Novel View Synthesis | iPhone dataset | SSIM0.479 | 33 | |
| Novel View Synthesis | Droid, BridgeData V2, and RoboCoin (test) | PSNR12.72 | 7 | |
| Camera-controlled Video Generation | Koala | RotErr0.637 | 6 | |
| Click Alarmclock | RoboTwin 0° viewpoint | Success Rate72 | 6 | |
| Camera control and 3D consistency | iPhone dataset | Translation Error1.325 | 6 | |
| Click Bell | RoboTwin 0° viewpoint | Success Rate9 | 6 | |
| Video Reshooting | DAVIS and Pexels 110 video-camera pairs (user study) | Source Preservation1.587 | 6 | |
| Video Reshooting | 110 video-camera pairs evaluation dataset (DAVIS and Pexels) | FID124.6 | 6 | |
| Generative Video Synthesis | RoboTwin | PSNR (dB)17.031 | 5 |