| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Reward Bench Factuality 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy56.6 | 64 | 1mo ago | |
| RM-Bench | Qwen-Instruct-32B-Ours | Chat Score75.6 | 55 | 5d ago | |
| RewardBench2 (test) | Accuracy82.9 | 20 | 12d ago | ||
| Reward Bench Ties 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy91.8 | 12 | 1mo ago | |
| Reward Bench Safety 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy72.3 | 12 | 1mo ago | |
| Reward Bench Math 2 | Distribution-Calibrated Aggregation | Pairwise Accuracy72.3 | 12 | 1mo ago | |
| Reward-Bench | FairJudge-8B | Agreement84.79 | 12 | 1mo ago |