Open-Sora: Democratizing Efficient Video Production for All

About

Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	VBench	Quality Score81.35	209
Video Generation	VBench	Quality Score81.35	126
Text-to-Video Generation	T2V-CompBench	Consistency Attribute Score0.672	92
Video Reconstruction	UCF-101	rFVD67.52	67
Video Generation	VideoPhy	SA (%)38	50
Video Generation	VBench Long	Motion Smoothness98.5	49
Video Generation	VBench 2.0 (test)	Total Score79.76	49
Image-to-Video Generation	VBench	Motion Smoothness0.992	46
Video Reconstruction	WebVid 10M	PSNR31.14	45
Text-to-Video Generation	EvalCrafter	Text-Video Alignment71.38	34

Showing 10 of 73 rows

...

Other info

Follow for update

@wizwand_team Discord