ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
About
Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | -- | 935 | |
| Multimodal Evaluation | MME | Score2.34e+3 | 557 | |
| Multimodal Evaluation | SEED-Bench | Accuracy76.64 | 80 | |
| Multimodal Evaluation | MMStar | Accuracy62.27 | 46 | |
| Vision Understanding | CVBench 2D | Accuracy70.84 | 22 | |
| Color Understanding | ColorBench | Accuracy39.06 | 18 | |
| Visual Grounding | Lisa Grounding | Accuracy63.2 | 18 | |
| Multimodal Visual Pattern Understanding | MMVP | Accuracy75.33 | 16 | |
| Multimodal Evaluation | MMT-Bench | Accuracy59.83 | 13 | |
| Multimodal Understanding | SEED-Bench (cleaned) | Overall Score88.3 | 10 |