MJ1: Multimodal Judgment via Grounded Verification

About

Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.

Bhavesh Kumar, Dylan Feng, Leonard Tang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Multimodal Preference Evaluation	MMRB2	T2I Accuracy80.2		16

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord