Phantom: Subject-consistent video generation via cross-modal alignment

About

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion. Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, Xinglong Wu• 2025

Related benchmarks

Task	Dataset	Result
subject-to-video generation	OpenS2V	Total58.1	32
Subject-to-video	OpenS2V Eval	Total Score56.77	23
Compositional Multi-Image-to-Video Generation	IntelligentVBench 3Subjects with BKG	IF2.36	21
Compositional Multi-Image-to-Video Generation	IntelligentVBench 1Subject with BKG	IF3.21	21
Compositional Multi-Image-to-Video Generation	IntelligentVBench 2Subjects with BKG	IF Score2.88	21
Subject-Preserving Video Generation	OpenS2V-Eval Human-Domain	Total Score58.69	17
Identity-Preserving Video Generation	OpenS2V (test)	Face Similarity0.519	17
Video Generation	User Study	Interaction Plausibility Score4.29	16
subject-to-video generation	OpenS2V-Eval zero-shot (test)	Total Score56.77	16
Single-ID Video Generation	Single-ID (evaluation)	ID-Sim49.2	13

Showing 10 of 65 rows

Other info

Code

Follow for update

@wizwand_team Discord