Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
About
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | Sintel | ATE0.319 | 192 | |
| Monocular Depth Estimation | Sintel | Abs Rel0.366 | 91 | |
| Depth Estimation | BONN | Abs Rel0.074 | 56 | |
| Camera pose estimation | TUM | ATE0.006 | 55 | |
| Video Depth Estimation | TUM dynamics | Abs Rel0.128 | 53 | |
| Pose Estimation | BONN | ATE0.016 | 38 | |
| Video Depth Estimation | PointOdyssey (val) | Abs Rel0.108 | 24 | |
| Video Frame Interpolation | BS-ERGB 3 skips | PSNR27.74 | 15 | |
| Video Frame Interpolation | BS-ERGB | FID16.37 | 12 | |
| Video Frame Prediction | GoPro 7 frames | PSNR19.02 | 10 |