Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

About

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding Reward Modeling	VURB	General Video Understanding40.6	18
Multimodal Judgment	MM-Vet	Overall Score33	16
Multimodal Judgment	C.C.	Score33.7	16
Multimodal Judgment	COCO	Score23.4	16
Video Reward Modeling	VideoRewardBench	Perception (long)35	16
Multimodal Judgment	DiFF	Score0.215	15
LLM-as-a-Judge	PandaLM Human Annotations (test)	Agreement0.7134	13
LLM-as-a-Judge	JudgeLM (test)	Agreement77.44	13
LLM-as-a-Judge	FairJudge Benchmark 1K (test)	Agreement59.58	13
Reward Modeling Evaluation	Reward-Bench	Agreement75.55	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord