PALO: A Polyglot Large Multimodal Model for 5B People

About

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy32.95	511
Multimodal Understanding	SEEDBench2 Plus	Accuracy38.08	138
Multimodal Understanding	MMMU	Accuracy33.11	34
Multilingual Multimodal Multiple-Choice Question Answering	Afri-MCQA	Average Accuracy24.47	15
Visual Question Answering	CVQA	--	14
Multilingual Visual Question Answering	MaXM	Avg. Score (MaXM)28.68	11
Multimodal Understanding	XMMMU	Avg_mul31.3	11
Visual Question Answering	xGQA	Avg_mul Score36.93	10
Multicultural Visual Reasoning	MaRVL	Avg_mul Score50.73	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord