Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction
About
Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | NVIDIA | PSNR20.51 | 20 | |
| Video Object Segmentation | DAVIS | IoU63.4 | 16 | |
| Novel View Synthesis | TUM-D | PSNR19.52 | 10 | |
| Novel View Synthesis | ADT | PSNR22.35 | 10 | |
| Novel View Synthesis | DyCheck | -- | 6 | |
| Point Tracking | DriveTrack 4 | Success Rate (dt=2)55.51 | 4 | |
| Point Tracking | ADT 73 | Accuracy (dt=2)93.98 | 4 | |
| Temporal-invariant Feature Extraction | DAVIS | < δ^0 Error2.2 | 4 | |
| Temporal-invariant Feature Extraction | RGB-Stacking | Displacement Error (< δ^0)4.9 | 4 | |
| Novel View Synthesis | DAVIS | PSNR20.181 | 3 |