Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

About

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingJudgeBench (test)
Overall65.5
40
Reward ModelingRM-Bench (test)
Overall Score73.2
39
Reward ModelingPPE Correctness (test)
PPE Corr60.2
26
Reward ModelingRewardBench (test)
RWBench0.926
25
Showing 4 of 4 rows

Other info

Follow for update