| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MTbench (test) | StdDev2.24 | 45 | 4d ago | ||
| PreferenceBench | CalibraEval | Rstd0.69 | 36 | 4d ago | |
| RewardBench 1.0 (test) | CC | Rstd0.54 | 36 | 4d ago | |
| PRISM (test) | SynthesizeMe | Accuracy58.9 | 14 | 4d ago | |
| Chatbot Arena (test) | Gemini-2.5-Pro | Accuracy68.13 | 14 | 4d ago | |
| FairJudge Benchmark 1K (test) | FairJudge-8B | Agreement71.5 | 13 | 4d ago | |
| JudgeLM (test) | Agreement79.59 | 13 | 4d ago | ||
| PandaLM Human Annotations (test) | FairJudge-8B | Agreement0.7683 | 13 | 4d ago | |
| Preference Bench (test) | CalibraEval | Std Dev2.82 | 9 | 4d ago | |
| RewardBench (test) | CalibraEval | Std Dev (Reward)2.72 | 9 | 4d ago | |
| JudgeBench | Accuracy84.19 | 8 | 4d ago | ||
| RewardBench | Qwen3-Next-80B-A3B-Thinking | Accuracy92.9 | 8 | 4d ago | |
| KD-DTI (test) | GPT-4o-Mini | EM Change53.41 | 8 | 4d ago | |
| DDI (test) | GPT-4o-Mini | EM (Δ)59.03 | 8 | 4d ago | |
| BC5CDR (test) | GPT-4o-Mini | EM48.35 | 8 | 4d ago |