Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multimodal Autoregressive Pre-training of Large Vision Encoders

About

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis B\'ethune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby• 2024

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP54
2454
Visual Question AnsweringVQA v2
Accuracy80.9
1165
Instance SegmentationCOCO 2017 (val)--
1144
Visual Question AnsweringTextVQA
Accuracy73.1
1117
Visual Question AnsweringGQA
Accuracy73.3
963
Image ClassificationStanford Cars--
477
Image ClassificationCIFAR-10--
471
Image ClassificationiNaturalist 2018
Top-1 Accuracy85.9
287
Visual Question AnsweringOKVQA
Top-1 Accuracy61.7
283
Visual Question AnsweringChartQA
Accuracy22.6
239
Showing 10 of 67 rows

Other info

Code

Follow for update