Depth Adaptive Efficient Visual Autoregressive Modeling

About

Visual Autoregressive (VAR) modeling inefficiently applies a fixed computational depth to each position when generating high-resolution images. While existing methods accelerate inference by pruning tokens using frequency maps, their binary hard-pruning approach is fundamentally limited and fails to improve quality even with better frequency estimation. Observing that VAR models possess significant depth redundancy, we propose a paradigm shift from pruning entire tokens to adaptively allocating per-token computational depth. To this end, we introduce DepthVAR, a training-free framework that dynamically allocates computation. It integrates an adaptive depth scheduler, which assigns computational depth via a cyclic rotated schedule for balanced, non-static refinement, with a dynamic inference process that translates these depths into layer-major masks, selectively applies transformer blocks, and blends the resulting codes to ensure each token's influence is proportional to its processing depth. Extensive experiments show that DepthVAR achieves 2.3$\times$-3.1$\times$ acceleration with minimal quality loss, offering a competitive compute-performance trade-off compared to existing hard-pruning approaches. Code is available at https://github.com/STOVAGtz/DepthVAR

Chunliang Li, Tianze Cao, Sanyuan Zhao• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score73.18	914
Human Preference Evaluation	ImageReward	Average Score0.9254	40
Human Preference Alignment	HPS v2.1	Anime Score31.52	16

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord