JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

About

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai• 2026

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	DIODE	AbsRel22.8	161
Monocular Depth Estimation	Sintel	Abs Rel0.21	142
Monocular Depth Estimation	Cityscapes	Accuracy (delta < 1.25)87.9	74
Semantic segmentation	ADE20k 512 x 512	mIoU54.86	24
Single-view depth estimation	DA-2K	Accuracy98.03	10
Semantic segmentation	Cityscapes 1024x2048px	mIoU81.92	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord