Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PaliGemma: A versatile 3B VLM for transfer

About

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Lucas Beyer, Andreas Steiner, Andr\'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bo\v{s}njak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Xiaohua Zhai• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringTextVQA--
1285
Multimodal UnderstandingMMBench
Accuracy65.6
637
Multimodal UnderstandingMM-Vet
MM-Vet Score33.1
531
Visual Question AnsweringGQA
Accuracy62.57
505
Multimodal ReasoningMM-Vet
MM-Vet Score33.1
431
Mathematical ReasoningMathVista
Score28.7
385
Visual Question AnsweringChartQA
Accuracy33.1
371
Multimodal UnderstandingSEED-Bench--
343
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy76.3
337
Showing 10 of 66 rows

Other info

Follow for update