Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

About

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench 17 frames, 512x512
UR45.9
11
Text-to-Video GenerationMovie 17 frames, 512x512 (val)
UR50.8
11
Text-to-Video GenerationMovieGenBench 17 frames, 512x512
UR34.5
11
Text-to-Video GenerationVBench 17 frames, 256x256
UR34.7
9
Text-to-Video GenerationMovieGenBench 17 frames, 256x256
UR30.2
9
Text-to-Video GenerationMovie 17 frames, 256x256 (val)
UR39.1
9
Video ReconstructionMovie (val)
PSNR37.51
9
Text-to-Video GenerationMovie 81 frames, 256x256 (val)
UR40.8
2
Text-to-Video GenerationVBench 81 frames, 256x256
UR37.8
2
Text-to-Video GenerationMovieGenBench 81 frames, 256x256
UR31.6
2
Showing 10 of 10 rows

Other info

Follow for update