Benchmarking Large Multimodal Models against Common Corruptions

About

This technical report aims to fill a deficiency in the assessment of large multimodal models (LMMs) by specifically examining the self-consistency of their outputs when subjected to common corruptions. We investigate the cross-modal interactions between text, image, and speech, encompassing four essential generation tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. We create a comprehensive benchmark, named MMCBench, that covers more than 100 popular LMMs (totally over 150 model checkpoints). A thorough evaluation under common corruptions is critical for practical deployment and facilitates a better understanding of the reliability of cutting-edge LMMs. The benchmarking code is available at https://github.com/sail-sg/MMCBench

Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Reward Modeling	VL-RewardBench	Accuracy19.04	102
Multimodal Reward Modeling	Multimodal RewardBench	Accuracy42	50
Multimodal Reward Modeling	RewardBench Multimodal	Safety Score19.1	44
Reward Modeling	VLRewardBench (test)	General7.2	39
Multimodal Reward Modeling	MM-RLHF-RewardBench	--	18
Multimodal Reward Modeling	VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench Aggregate	Accuracy26.05	13

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord