Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

About

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/

Damiano Marsili, Georgia Gkioxari• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy64.4
963
Visual Question AnsweringRealworldQA
Accuracy57.3
98
Visual ReasoningBLINK
Accuracy69.2
50
Spatial ReasoningVisual Spatial Reasoning (VSR)
Accuracy75.6
48
Spatial ReasoningRealworldQA
Accuracy57.3
32
Visual Question AnsweringTallyQA
Accuracy51
29
Countingcountbenchqa
Accuracy75.9
28
CountingTallyQA
Accuracy51
28
Visual Question AnsweringVSR--
26
Visual Question Answeringcountbenchqa
Accuracy75.9
20
Showing 10 of 20 rows

Other info

Follow for update