Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis

About

We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector

Shubhashis Roy Dipta, Tz-Ying Wu, Subarna Tripathi• 2025

Related benchmarks

TaskDatasetResultRank
Correlation with human judgmentFlickr8K-CF
Tau B46
48
Video Captioning Evaluation CorrelationVATEX Eval
Kendall's Tau-b42.58
40
Object Hallucination DetectionFOIL-COCO (test)
Accuracy99.6
25
Correlation with Human JudgmentsFlickr8k Expert
Tau-b Correlation63.4
19
Hallucination DetectionActivityNet-FOIL (test)
Accuracy99.3
5
Quality EstimationActivityNet FG (eval)
Kendall's Tau (b)49.53
4
Quality EstimationYouCook2 FG Eval
Kendall's Tau-b44.29
4
Video Caption EvaluationYouCook2 Eval (val)
Tau_b Score72.8
4
Showing 8 of 8 rows

Other info

Follow for update