EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

About

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU33.12	3069
Instance Segmentation	COCO 2017 (val)	--	1275
Image Classification	ImageNet-1K	Top-1 Acc72.9	1239
Semantic segmentation	ADE20K	mIoU50.7	1028
Semantic segmentation	Potsdam (test)	mIoU73.38	193
Semantic segmentation	LoveDA (test)	mIoU47.12	92
Semantic segmentation	ISPRS Vaihingen (test)	F1 Score76.44	47
Image Classification	ImageNet-1k (val)	Top-1 Accuracy0.835	45
Instance Segmentation	LVIS v1 (val)	--	34
Semantic segmentation	Augmented S2DS (test)	mIoU (Def.)53.45	13

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord