EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
About
High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU33.12 | 2731 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| Semantic segmentation | ADE20K | mIoU50.7 | 936 | |
| Semantic segmentation | Potsdam (test) | mIoU73.38 | 104 | |
| Semantic segmentation | LoveDA (test) | mIoU47.12 | 81 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy0.835 | 45 | |
| Instance Segmentation | LVIS v1 (val) | -- | 34 | |
| Semantic segmentation | ISPRS Vaihingen (test) | mIoU66.46 | 22 | |
| Image Classification | ImageNet-1K 1.0 (val) | Zero-shot Acc71.73 | 11 | |
| Super-Resolution | BSD100 160x240 to 320x480 (test) | PSNR32.33 | 6 |