| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | RewardBench | Accuracy97.8 | 166 | |
| Reward Modeling | RewardBench | Chat Score99.4 | 146 | |
| Reward Modeling | RewardBench v1.0 (test) | Average Score0.978 | 89 | |
| Reward Modeling | RewardBench Focus 2 | Accuracy90.3 | 82 | |
| Reward Modeling | RewardBench v2 | Accuracy92.1 | 72 | |
| Reward Modeling | RewardBench Precise IF 2 | Accuracy57.5 | 70 | |
| Reward Modeling | RewardBench Average 2 | Accuracy39.7 | 52 | |
| Reward Modeling | RewardBench Math 2 | Accuracy35.7 | 52 | |
| Reward Modeling | RewardBench v2 (test) | Average Score86.5 | 42 | |
| LLM-as-a-Judge | RewardBench 1.0 (test) | Rstd0.54 | 36 | |
| LLM-as-a-Judge Evaluation Consistency | RewardBench | Kappa83.25 | 36 | |
| Multimodal Reward Modeling | RewardBench Multimodal | Safety Score99.6 | 31 | |
| Reward Modeling | RewardBench 2 | Accuracy89.5 | 30 | |
| Multimodal Reward Modeling | Multimodal RewardBench | Accuracy88.79 | 30 | |
| Pair-wise comparison | RewardBench | Accuracy93.7 | 29 | |
| Reward Modeling | RewardBench v1 | Accuracy95.5 | 28 | |
| Reward Modeling | RewardBench (test) | RWBench0.933 | 25 | |
| Uncertainty Calibration | RewardBench | Kuiper0.009 | 24 | |
| Reward Modeling Evaluation | RewardBench2 (test) | Accuracy82.9 | 20 | |
| Reward Modeling | RewardBench 2 | L-Acc93.4 | 20 | |
| Reward Modeling | RewardBench unified-feedback (test) | Average Score84 | 20 | |
| Multi-modal Preference Evaluation | MM-RewardBench | Accuracy72.9 | 19 | |
| Reward Modeling | RewardBench Chat | Accuracy96.4 | 18 | |
| Multimodal Reward Modeling | MM-RLHF-RewardBench | Pairwise Accuracy92.4 | 18 | |
| Reward Modeling | RewardBench 1k | Positional Consistency84.9 | 16 |