Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SkyReels-A2: Compose Anything in Video Diffusion Transformers

About

This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, Yahui Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Multi-reference Video GenerationS2VTime
t-L20.268
18
subject-to-video generationOpenS2V-Eval zero-shot (test)
Total Score52.25
16
Single-ID Video GenerationSingle-ID (evaluation)
ID-Sim51.1
13
Image-to-Video GenerationVBench
Motion Smoothness0.96
12
Subject-to-videoOpenS2V Eval
Total Score52.25
11
subject-to-video generationOpenS2V-Nexus (held-out set of 180 subject-text pairs)
Total Score49.61
11
Compositional Multi-Image-to-Video GenerationIntelligentVBench 1Subject with BKG
IF3.51
10
Compositional Multi-Image-to-Video GenerationIntelligentVBench 2Subjects with BKG
IF Score3.22
10
Compositional Multi-Image-to-Video GenerationIntelligentVBench 3Subjects with BKG
IF1.64
10
Reference-to-Video GenerationOpenS2V-Eval 2025a
Total Score52.25
9
Showing 10 of 22 rows

Other info

Code

Follow for update