Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

About

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion. Project page: https://stjohn2007.github.io/MMHE_project/

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue• 2024

Related benchmarks

TaskDatasetResultRank
Image Captioning EvaluationComposite
Kendall-c Tau_c66.2
131
Image Captioning EvaluationFlickr8K-CF
Kendall-b Correlation (tau_b)39.2
99
Image Captioning EvaluationPascal-50S
Accuracy82.4
44
Image CaptioningFlickr8k-EX
Tau-c0.531
22
Hallucination EvaluationMMHE
REG66.6
11
Image Captioning EvaluationFOIL
Accuracy97.8
10
Image CaptioningMMHE User Study
Human Preference Count19
2
Referring expression generationMMHE User Study
Human Preference Count19
2
Visual Document UnderstandingMMHE User Study
Human Preference Count21
2
Visual Question AnsweringMMHE User Study
Human Preference Count12
2
Showing 10 of 10 rows

Other info

Follow for update