POINTS: Improving Your Vision-language Model with Affordable Strategies

About

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou• 2024

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	--	916
Multimodal Evaluation	MME	--	902
Visual Mathematical Reasoning	MathVista	Accuracy63.1	448
Multi-discipline Multimodal Understanding	MMMU	--	422
Diagram Understanding	AI2D	Accuracy80.9	377
OCR Evaluation	OCRBench	Score72	350
Visual Understanding	MM-Vet	MM-Vet Score52.3	190
Hallucination Evaluation	HallusionBench	--	153
Multimodal Conversation	LLaVA-Bench Wild	Score71.1	78
Visual Reasoning	MMStar	Accuracy60.9	51

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord