SkyReels-V3 Technique Report

About

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Wenjing Cai, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, Yahui Zhou• 2026

Related benchmarks

Task	Dataset	Result
Compositional Multi-Image-to-Video Generation	IntelligentVBench 3Subjects with BKG	IF2.59	21
Compositional Multi-Image-to-Video Generation	IntelligentVBench 2Subjects with BKG	IF Score3.28	21
Compositional Multi-Image-to-Video Generation	IntelligentVBench 1Subject with BKG	IF3.46	21
Video Generation	User Study	Interaction Plausibility Score4.54	16
Video Personalization	OpenS2V-Eval & Self-Constructed (In-Domain test)	DINO-I Score0.407	11
Video Personalization	OpenS2V-Eval & Self-Constructed (test)	AES0.481	11
Video Personalization	OpenS2V-Eval & Self-Constructed Cross-Domain (test)	NANO-CLIP Score0.593	11
HOI Video Generation	HOI video generation (test)	AES Score56.3	7
Video Generation	Custom V3 (test)	Reference Consistency66.98	4
Talking Avatar Generation	Talking Avatar Evaluation Set (test)	Audio-Visual Sync8.18	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord