Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

About

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at https://github.com/apple/ml-fastvit.

Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU41
2731
Object DetectionCOCO 2017 (val)--
2454
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy83.9
1866
Instance SegmentationCOCO 2017 (val)
APm0.359
1144
Semantic segmentationADE20K
mIoU38
936
Image ClassificationImageNet-1K
Top-1 Acc80.3
836
Fine-grained Image ClassificationCUB200 2011 (test)
Accuracy49.9
536
Image ClassificationImageNet-1k (val)
Top-1 Accuracy84.6
512
Object DetectionCOCO 2017
AP (Box)38.9
279
Instance SegmentationCOCO 2017
APm35.9
199
Showing 10 of 34 rows

Other info

Code

Follow for update