AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

About

Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.

Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li, Ji Guo, Wenbo Jiang, Guoqing Wang, Guoming Lu (1 and 2) __INSTITUTION_9__ University of Electronic Science, Technology of China, (2) Ubiquitous Intelligence, Trusted Services Key Laboratory of Sichuan Province)• 2026

Related benchmarks

Task	Dataset	Result
Image Super-resolution	RealSR	LPIPS0.2871	257
Image Super-resolution	DIV2K (val)	LPIPS0.2955	215
Image Super-resolution	DRealSR	MUSIQ63.83	182
Image Super-resolution	512 x 512 resolution	Inference Time (s)0.43	6

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord