Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
About
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Frame Interpolation | BS-ERGB 3 skips | PSNR27.74 | 15 | |
| Video Frame Prediction | GoPro 7 frames | PSNR19.02 | 10 | |
| Video Frame Prediction | GoPro 15 frames | PSNR18.56 | 10 | |
| Video Frame Prediction | BS-ERGB 1 frame (test) | PSNR21.22 | 10 | |
| Video Frame Prediction | BS-ERGB 3 frames (test) | PSNR18.81 | 10 | |
| Video Frame Prediction | HS-ERGB 7 frames (test) | PSNR20.12 | 10 | |
| Video Frame Interpolation | HQF 3 skips | PSNR29.04 | 9 | |
| Video Frame Interpolation | Clear-Motion 15 skips | PSNR22.94 | 9 | |
| Video Frame Interpolation (11x) | Real-world | MSE0.0057 | 4 | |
| Video Frame Interpolation (11x) | Synthetic | MSE0.0503 | 4 |