Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
About
We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Frame Interpolation | MultiInterpBench | FID53.4 | 24 | |
| Video Frame Interpolation | VidGen 1M (test) | FVD698 | 11 | |
| Video Frame Interpolation | Pexels 45 video-keyframe pairs | LPIPS0.1114 | 8 | |
| Video Inbetweening | DAVIS | Alignment0.1179 | 8 | |
| Video Frame Interpolation | DAVIS 100 video-keyframe pairs 2017 | LPIPS0.2432 | 8 | |
| Video Frame Interpolation | VFI 1024 x 576 (test) | PSNR21.05 | 8 | |
| Video Generation | Video Generation 25-frame | PSNR17.418 | 6 | |
| Video Generation | TGI-Bench 81-frame | PSNR15.59 | 6 | |
| Generative Inbetweening | TGI-Bench 65-frame | X-CLIP Score0.2169 | 6 | |
| Generative Inbetweening | TGI-Bench 81-frame | X-CLIP Score0.2082 | 6 |