| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | RewardBench | Avg Score95.1 | 118 | |
| Reward Modeling | RewardBench Focus 2 | Accuracy90.3 | 82 | |
| Reward Modeling | RewardBench Precise IF 2 | Accuracy57.5 | 70 | |
| Reward Modeling | RewardBench | Accuracy95.1 | 70 | |
| Reward Modeling | RewardBench Average 2 | Accuracy39.7 | 52 | |
| Reward Modeling | RewardBench Math 2 | Accuracy35.7 | 52 | |
| LLM-as-a-Judge | RewardBench 1.0 (test) | Rstd0.54 | 36 | |
| LLM-as-a-Judge Evaluation Consistency | RewardBench | Kappa83.25 | 36 | |
| Reward Modeling | RewardBench v1.0 (test) | Chat Score0.9777 | 27 | |
| Reward Modeling | RewardBench (test) | RWBench0.933 | 25 | |
| Uncertainty Calibration | RewardBench | Kuiper0.009 | 24 | |
| Reward Modeling | RewardBench 2 | L-Acc93.4 | 20 | |
| Reward Modeling | RewardBench unified-feedback (test) | Average Score84 | 20 | |
| Multi-modal Preference Evaluation | MM-RewardBench | Accuracy72.9 | 19 | |
| Reward Modeling | RewardBench Chat | Accuracy96.4 | 18 | |
| Multimodal Reward Modeling | Multimodal RewardBench | Accuracy85.4 | 17 | |
| Pairwise LLM Judging | RewardBench | Coverage100 | 16 | |
| Pair-wise comparison | RewardBench | Accuracy93.7 | 16 | |
| Reward Modeling | RewardBench v2 | Accuracy90.7 | 14 | |
| Reward Modeling | RewardBench latest (full) | Average Score93.6 | 11 | |
| Listwise Judging | RewardBench listwise 2 | IF Score58.1 | 10 | |
| Reward Modeling | RewardBench 2 (test) | RWBench2 Score76.3 | 9 | |
| Multimodal Reward Modeling | VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench Aggregate | Accuracy82.44 | 9 | |
| Multimodal Reward Modeling | MM-RLHF-RewardBench | Accuracy85.88 | 9 | |
| LLM-as-a-Judge | RewardBench (test) | Std Dev (Reward)2.72 | 9 |