Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Phantom: Subject-consistent video generation via cross-modal alignment

About

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion. Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, Xinglong Wu• 2025

Related benchmarks

TaskDatasetResultRank
Identity-Preserving Video GenerationOpenS2V (test)
Face Similarity0.519
17
subject-to-video generationOpenS2V-Eval zero-shot (test)
Total Score56.77
16
Single-ID Video GenerationSingle-ID (evaluation)
ID-Sim49.2
13
Subject-to-videoOpenS2V Eval
Total Score56.77
11
subject-to-video generationOpenS2V-Nexus (held-out set of 180 subject-text pairs)
Total Score52.32
11
Compositional Multi-Image-to-Video GenerationIntelligentVBench 3Subjects with BKG
IF2.36
10
Compositional Multi-Image-to-Video GenerationIntelligentVBench 1Subject with BKG
IF3.21
10
Compositional Multi-Image-to-Video GenerationIntelligentVBench 2Subjects with BKG
IF Score2.88
10
Multi-shot Video Generation90 prompts evaluation suite
Type Accuracy62.11
9
Video Customization70-example benchmark 1.0 (test)
FaceSim Arc0.58
9
Showing 10 of 46 rows

Other info

Code

Follow for update