Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

About

Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen• 2024

Related benchmarks

TaskDatasetResultRank
Reward ModelingJudgeBench (test)
Overall42.3
40
Reward ModelingRM-Bench (test)
Overall Score54.4
39
Reward ModelingPPE Correctness (test)
PPE Corr48
26
Creative WritingArena-Hard Creative Writing v2
Score39.4
25
Reward ModelingRewardBench (test)
RWBench0.812
25
Text Quality Meta-evaluationTopical-Chat (Local)
Understandability0.795
16
Text SummarizationSummEval Global
Coherence84.4
16
Text Quality Meta-evaluationSummEval (Local)
Coherence0.665
16
Text Quality Meta-evaluationSummEval & Topical-Chat Combined (Overall)
Overall Score66.7
16
Dialogue Response GenerationTopical-Chat Global
Und94.6
16
Showing 10 of 15 rows

Other info

Follow for update