Phantom: Subject-consistent video generation via cross-modal alignment
About
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion. Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Identity-Preserving Video Generation | OpenS2V (test) | Face Similarity0.519 | 17 | |
| subject-to-video generation | OpenS2V-Eval zero-shot (test) | Total Score56.77 | 16 | |
| Single-ID Video Generation | Single-ID (evaluation) | ID-Sim49.2 | 13 | |
| Subject-to-video | OpenS2V Eval | Total Score56.77 | 11 | |
| subject-to-video generation | OpenS2V-Nexus (held-out set of 180 subject-text pairs) | Total Score52.32 | 11 | |
| Compositional Multi-Image-to-Video Generation | IntelligentVBench 3Subjects with BKG | IF2.36 | 10 | |
| Compositional Multi-Image-to-Video Generation | IntelligentVBench 1Subject with BKG | IF3.21 | 10 | |
| Compositional Multi-Image-to-Video Generation | IntelligentVBench 2Subjects with BKG | IF Score2.88 | 10 | |
| Multi-shot Video Generation | 90 prompts evaluation suite | Type Accuracy62.11 | 9 | |
| Video Customization | 70-example benchmark 1.0 (test) | FaceSim Arc0.58 | 9 |