Video Motion Transfer with Diffusion Transformers
About
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | VBench | -- | 102 | |
| Motion Transfer | DAVIS Caption | MF Score0.79 | 12 | |
| Motion Transfer | DAVIS Subject | MF77.5 | 12 | |
| Motion Transfer | DAVIS Scene | MF Score0.789 | 12 | |
| Motion Transfer | DAVIS All | MF0.785 | 12 | |
| Motion Transfer | DAVIS Easy | CLIP Score0.3174 | 9 | |
| Motion Transfer | DAVIS Hard | CLIP Score0.3191 | 9 | |
| Motion Transfer | DAVIS Medium | CLIP Score0.3204 | 9 | |
| Motion Transfer | DAVIS (All subsets) | CLIP Score0.3178 | 9 | |
| Video Motion Transfer | DAVIS | Text Similarity20.91 | 8 |