Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

POINTS: Improving Your Vision-language Model with Affordable Strategies

About

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME--
557
OCR EvaluationOCRBench
Score72
296
Multi-discipline Multimodal UnderstandingMMMU--
266
Science Question AnsweringScienceQA--
229
Visual Mathematical ReasoningMathVista
Accuracy63.1
189
Diagram UnderstandingAI2D
Accuracy80.9
167
Visual UnderstandingMM-Vet
MM-Vet Score52.3
102
Hallucination EvaluationHallusionBench
Average Score48
93
Multimodal ConversationLLaVA-Bench Wild
Score71.1
52
Multi-modal Visual CapabilityMMStar
Score61
20
Showing 10 of 13 rows

Other info

Follow for update