Tuning-free Visual Effect Transfer across Videos
About
We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website at https://snap-research.github.io/RefVFX/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-to-Video Generation | Neural V2V | AES0.5649 | 5 | |
| Image-to-Video Generation | I2V | AES56.07 | 4 | |
| Reference-based Video Effect Transfer | Neural V2V | Input Similarity0.8568 | 4 | |
| Video-to-Video Generation | Code Based V2V | AES0.4802 | 4 | |
| Image-to-Video Generation | Image-to-Video (I2V) unseen LoRA effects (val) | Ref Video Adherence (Win Rate)81.5 | 3 | |
| Reference-based Video Effect Transfer | Code Based V2V | Input Similarity94.79 | 3 | |
| Image-to-Video | RefVFX | First Frame Similarity0.7698 | 3 |