Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

About

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.5
2019
Visual Question AnsweringVQA v2
Accuracy86.7
1429
Visual Question AnsweringGQA
Accuracy71.1
1425
Multimodal UnderstandingMMStar
Accuracy52.4
407
Diagram Question AnsweringAI2D
AI2D Accuracy85.3
387
Chart Question AnsweringChartQA
Accuracy63.8
371
Multimodal Perception and CognitionMME--
270
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy45.2
216
Document Visual Question AnsweringDocVQA
Accuracy77.5
203
Multimodal UnderstandingSEED-Bench Image
Accuracy72.9
143
Showing 10 of 16 rows

Other info

Follow for update