Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

About

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo• 2024

Related benchmarks

Task	Dataset	Result
Hallucination Evaluation	POPE	--	281
Hallucination assessment	HallusionBench	Answer Accuracy (aAcc)68.35	39
Multi-modal Visual Capability	MMStar	Score60.87	29
Multi-image visual perception	BLINK	Accuracy54.13	26
Multidisciplinary knowledge and reasoning	MMMU (dev)	Score22.67	9
Real-world Understanding	RealworldQA	Score68.24	9
Perceptual Robustness	HRBench 4K	Overall Score64.88	9
Perceptual Robustness	VSTAR	Overall Accuracy70.17	9
Perceptual Robustness	HRBench-8K	Overall Score64.88	9
Benchmark Quality Evaluation	VLSafetyBencher	MAD7.1	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord