| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| VL-RewardBench | MSRL + voting@16 | Accuracy77.5 | 76 | 4d ago | |
| RewardBench Multimodal | R1-Reward | Safety Score99.6 | 31 | 4d ago | |
| Multimodal RewardBench | Gemini 3.1 Pro | Accuracy88.79 | 30 | 3d ago | |
| MR2Bench Video | Molmo2-4B Multi-response RM | Best-of-4 Accuracy50.7 | 18 | 4d ago | |
| VideoRewardBench | GPT-5 | Macro Pairwise Accuracy68.2 | 18 | 4d ago | |
| MR2Bench Image | GPT-5 | Best-of-4 Accuracy87.1 | 18 | 4d ago | |
| MM-RLHF-RewardBench | Molmo2-4B Multi-response RM | Pairwise Accuracy92.4 | 18 | 4d ago | |
| MM-RLHF-Reward Bench | Proxy-GRM-RL | Accuracy82.94 | 14 | 1mo ago | |
| Multimodal Reward Bench | Proxy-GRM-RL | Reward Bench Score85.62 | 12 | 1mo ago | |
| VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench Aggregate | EGT | Accuracy82.44 | 9 | 1mo ago | |
| PhyCritic-Bench | Gemini-2.5-Pro | Overall Score78.2 | 8 | 1mo ago | |
| RewardBench 2 | SW-RM-V2-LLaMA3.1-8B | Safety Score96.7 | 5 | 1mo ago | |
| UniReward In-Domain (test) | UniRM | Quality Score99.3 | 5 | 1mo ago |