Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

iFormer: Integrating ConvNet and Transformer for Mobile Application

About

We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

Chuanyang Zheng• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU44.5
2888
Object DetectionCOCO 2017 (val)--
2643
Instance SegmentationCOCO 2017 (val)--
1201
Boundary DetectionCO2 Farm Thermal Gas Dataset 1.0 (test)
BF1 Score69.33
17
Semantic segmentationCO2 Farm Thermal Gas Dataset 1.0 (test)
mIoU96.08
17
Image ClassificationCO2 Farm Thermal Gas Dataset 1.0 (test)
Accuracy51
17
Image ClassificationImageNet-1k (val)
Top-1 Accuracy82.7
15
Image ClassificationWaterhemp (test)
mAcc77.51
13
Semantic segmentationWaterhemp (test)
mIoU94.05
13
Showing 9 of 9 rows

Other info

Follow for update