| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| RM-Bench | Flexible Principles ScalarRM | Chat Score85.3 | 69 | 8d ago | |
| Reward Bench Factuality 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy56.6 | 64 | 3mo ago | |
| RewardBench2 (test) | Accuracy82.9 | 20 | 1mo ago | ||
| Reward Bench Ties 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy91.8 | 12 | 3mo ago | |
| Reward Bench Safety 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy72.3 | 12 | 3mo ago | |
| Reward Bench Math 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy72.3 | 12 | 3mo ago | |
| Reward-Bench | FairJudge-8B | Agreement84.79 | 12 | 3mo ago | |
| HelpSteer3 (test) | DPO+Filter | Score-5.89 | 5 | 22d ago | |
| UltraFeedback (test) | DPO+Filter | Score-3.12 | 5 | 22d ago | |
| RewardBench | SFT + TTL + DPO + TPO | R-Bench Score88.1 | 3 | 23d ago |