Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

About

Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira• 2025

Related benchmarks

Task	Dataset	Result
Verification of Digital Agent Trajectories	VisualWebArena (VWA) and OSWorld Trajectories	Accuracy86	28
Web task automation	VisualWebArena full	SR54	21
Web navigation	VisualWebArena	Published VWA Success Rate54	13
Web Navigation and Task Automation	VisualWebArena 910 tasks (Full)	Success Rate (%)54	9
Reward Verification	AgentRewardBench VisualWebArena	Precision100	7
Web navigation	VisualWebArena All Shopping Classifieds Reddit	Success Rate (All)54	6
Visual web navigation / Agent interaction	VisualWebArena	Success Rate0.54	5
Visual Web Navigation	VWA-910	Success Rate (%)54	4
Web navigation	OSWorld (All)	Success Rate (All)27	3
Computer Use	VWA-910	Headline SR54	1

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord