SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

About

Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic guidance framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.

Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang• 2026

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	IS316.2	606
Image Generation	ImageNet 256x256 (train)	FID1.52	247
Text-to-Image Generation	MS-COCO (val)	FID4.67	215
Text-to-Image Generation	MS-COCO	FID4.91	193
Image Generation	ImageNet 256x256 (test)	FID1.52	125
Image Generation	ImageNet 512x512 (test/val)	FID2.45	39

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord