VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis
About
We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Correlation with human judgment | Flickr8K-CF | Tau B46 | 48 | |
| Video Captioning Evaluation Correlation | VATEX Eval | Kendall's Tau-b42.58 | 40 | |
| Object Hallucination Detection | FOIL-COCO (test) | Accuracy99.6 | 25 | |
| Correlation with Human Judgments | Flickr8k Expert | Tau-b Correlation63.4 | 19 | |
| Hallucination Detection | ActivityNet-FOIL (test) | Accuracy99.3 | 5 | |
| Quality Estimation | ActivityNet FG (eval) | Kendall's Tau (b)49.53 | 4 | |
| Quality Estimation | YouCook2 FG Eval | Kendall's Tau-b44.29 | 4 | |
| Video Caption Evaluation | YouCook2 Eval (val) | Tau_b Score72.8 | 4 |