PaliGemma 2: A Family of Versatile VLMs for Transfer

About

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

Andreas Steiner, Andr\'e Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy68.3	1445
Visual Question Answering	VQA v2	Accuracy85.8	1429
Diagram Question Answering	AI2D	AI2D Accuracy84.6	509
Video Question Answering	ActivityNet-QA	--	438
Chart Question Answering	ChartQA	Accuracy66.4	404
Radiology Report Generation	MIMIC-CXR (test)	--	235
Document Visual Question Answering	DocVQA	Accuracy76.6	203
Document Visual Question Answering	DocVQA (val)	Accuracy69.8	178
Image Captioning	COCO	CIDEr145.2	130
Robotic Manipulation	Calvin ABC->D	Task-1 Score90.1	71

Showing 10 of 45 rows

Other info

Follow for update

@wizwand_team Discord