Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
About
Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning methods require approximately 200 A100 GPU hours to generate a customized video, whereas zero-shot methods avoid per-subject tuning but typically rely on millions of subject-video pairs for the supervision, incurring massive network fine-tuning costs (10K-200K A100 GPU hours). We propose a data- and compute-efficient zero-shot SDV-Gen framework that avoids test-time per-subject tuning and the use of large-scale subject-video pairs. Our key idea decomposes SDV-Gen into (i) identity injection learned from subject-image pairs and (ii) motion-awareness preservation maintained by a small set of arbitrary videos. We optimize the two tasks with stochastic switching, using random reference-frame sampling and image-token dropout to prevent trivial first-frame copying. Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization. Using CogVideoX-5B, we adapt a single model with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours. This yields about 1% of compute compared to prior zero-shot baselines (i.e., 0.4% of VACE and 2.8% of Phantom) while using no subject-video pairs, yet remaining competitive in subject fidelity and motion quality. We show that the same recipe transfers to Wan 2.2-5B.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| subject-to-video generation | OpenS2V | Total53.26 | 20 | |
| Subject-driven video generation | VBench | Motion Smoothness98.53 | 8 | |
| Subject-driven video generation | Subject-driven Video Generation | Training Steps4.00e+3 | 7 | |
| Personalized Video Generation | Personalized Video Generation Dataset | IDINO66.1 | 5 | |
| Video Personalization | VBench | Subject Consistency0.9811 | 5 | |
| Video Generation | Pexels | ID Consistency4.08 | 4 |