| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| PreferenceBench | CalibraEval | Accuracy90.71 | 59 | 1mo ago | |
| MTbench (test) | StdDev2.24 | 45 | 3mo ago | ||
| MT-Bench | PA-GRPO | Accuracy81.4 | 44 | 1mo ago | |
| RewardBench 1.0 (test) | CC | Rstd0.54 | 36 | 3mo ago | |
| RewardBench | Qwen3-Next-80B-A3B-Thinking | Accuracy92.9 | 31 | 1mo ago | |
| JudgeBench | Accuracy84.19 | 29 | 2mo ago | ||
| PreferenceBench | PA-GRPO | Accuracy90.2 | 21 | 2mo ago | |
| High-contrast response pairs | LongCat-Flash-Chat | Discriminability (πi)0.87 | 20 | 1mo ago | |
| ARENA | EpiPersona-A | Accuracy66.07 | 20 | 2mo ago | |
| PRISM | EpiPersona-A | Accuracy59.38 | 20 | 2mo ago | |
| PRISM (test) | SynthesizeMe | Accuracy58.9 | 14 | 3mo ago | |
| Chatbot Arena (test) | Gemini-2.5-Pro | Accuracy68.13 | 14 | 3mo ago | |
| FairJudge Benchmark 1K (test) | FairJudge-8B | Agreement71.5 | 13 | 3mo ago | |
| JudgeLM (test) | Agreement79.59 | 13 | 3mo ago | ||
| PandaLM Human Annotations (test) | FairJudge-8B | Agreement0.7683 | 13 | 3mo ago | |
| TL;DR | Coverage82.6 | 12 | 16d ago | ||
| Chatbot Arena | Coverage94.3 | 12 | 16d ago | ||
| HH-RLHF | Coverage81.3 | 12 | 16d ago | ||
| AlpacaEval | Coverage78.3 | 12 | 16d ago | ||
| Preference Bench (test) | CalibraEval | Std Dev2.82 | 9 | 3mo ago | |
| RewardBench (test) | CalibraEval | Std Dev (Reward)2.72 | 9 | 3mo ago | |
| JudgeBench (Merged GPT Claude) | Direct Baseline Score87.38 | 8 | 1mo ago | ||
| KD-DTI (test) | GPT-4o-Mini | EM Change53.41 | 8 | 3mo ago | |
| DDI (test) | GPT-4o-Mini | EM (Δ)59.03 | 8 | 3mo ago | |
| BC5CDR (test) | GPT-4o-Mini | EM48.35 | 8 | 3mo ago |