Moaw: Unleashing Motion Awareness for Video Diffusion Models
About
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 2D Long-range optical flow | CVO Clean 7 frames | EPE (all)8.25 | 16 | |
| 2D Long-range optical flow | CVO 7 frames (Final) | EPE (all)8.2 | 16 | |
| 2D Long-range optical flow | CVO Extended (48 frames) | EPE (all)39.89 | 10 | |
| Dense 3D Tracking | Kubric-3D 24 frames (test) | APD3D54.7 | 4 | |
| Motion Transfer | Constructed Motion Transfer Dataset 120 videos 1.0 (test) | EPE Down14.74 | 2 |