| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| HH-RLHF | LLM-Judge | Accuracy61.4 | 30 | 7d ago | |
| Math Reasoning | BTPO | Accuracy87.6 | 20 | 2mo ago | |
| Instruction Following | BTPO | Accuracy65.2 | 20 | 2mo ago | |
| Helpfulness & Harmlessness | BTPO | Accuracy72.2 | 20 | 2mo ago | |
| RewardBench 2 | HRC | Factuality68.42 | 10 | 15d ago | |
| Arena-Hard V2 | Nanbeige4.1-3B | Win Rate73.2 | 9 | 3mo ago | |
| HelpSteer2 held-out (test) | Mean-Var | Preference Accuracy68.4 | 7 | 3mo ago | |
| MultiPref held-out (test) | Mean-Var | Preference Accuracy66.4 | 6 | 3mo ago | |
| GAIP 1.0 (test) | JAC | NDCG@50.3193 | 4 | 3mo ago |