Parrot: Multilingual Visual Instruction Tuning

About

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Multilingual Reasoning	MMMB	English Accuracy80.1	39
Multilingual text-centric visual question answering	MTVQA	--	37
Multilingual Multimodal Reasoning	MMMB, Multilingual MMBench, and MTVQA Combined	Overall Accuracy59.7	18
Multilingual Vision-Language Reasoning	MMBench Multilingual	Accuracy (en)78	18

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord