TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

About

Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

Andr\'e G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, Andr\'e F. T. Martins• 2025

Related benchmarks

Task	Dataset	Result
Captioning	COCO pt-PT	Accuracy24.4	16
Mathematics	MVision pt-PT	Accuracy14.6	16
Spatial Understanding	RefRec pt-PT	Accuracy8.3	16
General VQA	POPE pt-PT	Accuracy85.1	16
Captioning	RefCap pt-PT	Accuracy4.9	16
General VQA	SEED pt-PT	Accuracy70.8	16
General VQA	MMMU pt-PT	Accuracy39.1	16
OCR & Document Understanding	TxtVQA pt-PT	Accuracy58.3	16
OCR & Document Understanding	InfoVQA pt-PT	Accuracy38.3	16
Chart & Diagram Understanding	AI2D pt-PT	Accuracy65	16

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord