VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

About

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish \textbf{reproducible} evaluation results. In VLMEvalKit, we implement over 450+ large multi-modality model configurations, including both proprietary APIs and open-source models, and support 330+ benchmarks across diverse multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. VLMEvalKit has also evolved to a broader evaluation suite spanning video/audio, document understanding, GUI grounding, spatial reasoning, safety, scientific reasoning, and multi-turn dialogue. Based on the evaluation results obtained with the toolkit, we host the OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.

Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Zerun Ma, Yuxuan Qiao, Mo Li, Tianhao Liang, Lin Zhu, Amit Agarwal, Xiaozhe Li, Shengyuan Ding, Jiazi Bu, Ziyu Liu, Zhangyang Qi, Yifei Li, Yuhang Zang, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Junming Lin, Xilin Wei, Jinsong Li, Zeyi Sun, Zhaowei Wang, Zicheng Zhang, Xiaoyi Dong, Junjun He, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen• 2024

Related benchmarks

Task	Dataset	Result
Hallucination and Visual Reasoning Evaluation	HallusionBench	Accuracy (aACC)68.5	61
Multimodal Capability Evaluation	MMStar	Overall Score60.7	31
Object Hallucination Detection	POPE	Accuracy89.1	11
General Multimodal Performance	POPE, HallusionBench, MMStar Average	Overall Score66.5	11
Open-ended Question Answering	OKVQA	LVM Evaluation Score70.8	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord