| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | RewardBench | Chat Score99.4 | 216 | |
| Reward Modeling | RewardBench | Accuracy97.8 | 166 | |
| Reward Modeling | RewardBench v1.0 (test) | Average Score0.978 | 89 | |
| Reward Modeling | RewardBench Focus 2 | Accuracy90.3 | 82 | |
| Reward Modeling | RewardBench v2 | Accuracy92.1 | 72 | |
| Reward Modeling | RewardBench Precise IF 2 | Accuracy57.5 | 70 | |
| Reward Modeling | RewardBench v2 (test) | Average Score86.5 | 67 | |
| Reward Modeling | RewardBench Average 2 | Accuracy39.7 | 52 | |
| Reward Modeling | RewardBench Math 2 | Accuracy35.7 | 52 | |
| Multimodal Reward Modeling | Multimodal RewardBench | Accuracy60.7 | 50 | |
| Multimodal Reward Modeling | RewardBench Multimodal | Safety Score99.6 | 44 | |
| MLLM-as-a-judge evaluation | VL RewardBench | Accuracy80.75 | 42 | |
| Reward Modeling | RewardBench Chat | Accuracy96.4 | 42 | |
| Reward Modeling | RewardBench 2 | Precise IF Score71 | 41 | |
| Reward Modeling | RewardBench (full) | Chat Score99.2 | 41 | |
| Reward Modeling | RewardBench | Accuracy88.8 | 36 | |
| LLM-as-a-Judge | RewardBench 1.0 (test) | Rstd0.54 | 36 | |
| LLM-as-a-Judge Evaluation Consistency | RewardBench | Kappa83.25 | 36 | |
| LLM-as-a-Judge | RewardBench | Accuracy92.9 | 31 | |
| Reward Modeling | RewardBench 2 | Accuracy89.5 | 30 | |
| Pair-wise comparison | RewardBench | Accuracy93.7 | 29 | |
| Reward Modeling | RewardBench v1 | Accuracy95.5 | 28 | |
| Reward Modeling | RewardBench (test) | RWBench0.933 | 25 | |
| Reward Modeling | RewardBench latest (test) | Accuracy74.9 | 24 | |
| Uncertainty Calibration | RewardBench | Kuiper0.009 | 24 |