Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision Transformers with Mixed-Resolution Tokenization

About

Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword - a chunk of raw data of arbitrary size. In this work, we apply this approach to Vision Transformers by introducing a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens, where each token represents a patch of arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution, routing more of the model's capacity to important image regions. Using the same architecture as vanilla ViTs, our Quadformer models achieve substantial accuracy gains on image classification when controlling for the computational budget. Code and models are publicly available at https://github.com/TomerRonen34/mixed-resolution-vit .

Tomer Ronen, Omer Levy, Avram Golbert• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.8
2019
Visual Question AnsweringGQA
Accuracy61.1
1425
Visual Question AnsweringVQA v2
Accuracy78.9
333
Multimodal Model EvaluationMMBench
Accuracy65.8
204
Multimodal EvaluationMMBench CN
Accuracy62.5
120
Semantic segmentationADE20K
mIoU58.83
71
Large Multimodal Model EvaluationMM-Vet
Average Score33.9
69
Object DetectionCOCO
mAP61.75
41
Visual Question AnsweringSQA-I
SQA-I Accuracy72
34
Visual Question AnsweringVQA-T
Accuracy59.1
29
Showing 10 of 12 rows

Other info

Follow for update