Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

About

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Peter Vajda• 2020

Related benchmarks

TaskDatasetResultRank
Human Perception RegressionStreet View Imagery
RMSE1.5271
39
Imaging Platform ClassificationStreet View Imagery
F1 Score35.4
39
Socio-economic Indicator RegressionStreet View Imagery
RMSE0.7012
39
View Direction ClassificationStreet View Imagery
F1 Score58.5
39
Lung nodule classificationLIDC-IDRI
AUC81.92
36
Sensitivity Analysis of Radiomic FeaturesLUNA
Odds Ratio (OR)8.4
28
Sensitivity Analysis of Radiomic FeaturesRadioLung
Odds Ratio (n)20
28
Lung nodule classificationUSTC-FHLN
Accuracy80.43
13
WSI ClassificationDHMC
Weighted F1 Score68.4
12
Urban Perception PredictionUrban Perception Wealthy (test)
Macro F1 Score56.3
11
Showing 10 of 12 rows

Other info

Follow for update