Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

About

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, Yuxin Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Video ReconstructionWebVid 10M
PSNR30.91
34
Video ReconstructionInter-4K
SSIM0.806
12
Image-to-Video GenerationVBench I2V 720x1280 2.0 (test)
Total Score88.12
6
Showing 3 of 3 rows

Other info

Follow for update