Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multimodal Reward Modeling on MR2Bench Image
Loading...
87.1
Best-of-4 Accuracy
GPT-5
43.316
54.683
66.05
77.417
Apr 13, 2026
Best-of-4 Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Best-of-4 Accuracy
GPT-5
2026.04
87.1
Claude-Sonnet-4.5
2026.04
72.9
Gemini-2.5-Pro
2026.04
71.2
InternVL3-78B
Size=78B
2026.04
65
Molmo2-4B Multi-response RM
Size=4B
2026.04
62.5
Molmo2-4B
Size=4B
2026.04
61.7
Qwen3-VL-4B
Size=4B
2026.04
60.8
Qwen3-VL-32B
Size=32B
2026.04
60.8
Qwen3-VL-8B
Size=8B
2026.04
60.4
Molmo2-8B
Size=8B
2026.04
60
R1-Reward
Size=7B
2026.04
58.8
Qwen3-VL-4B Multi-response RM
Size=4B
2026.04
58.8
LLaVA-Critic
Size=7B
2026.04
56.3
InternVL3-8B
Size=8B
2026.04
55.4
IXC-2.5-Reward
Size=7B
2026.04
55
Skywork-VL-Reward
Size=7B
2026.04
52.9
Qwen2.5-VL-7B
Size=7B
2026.04
52.5
MM-RLHF-Reward
Size=7B
2026.04
45
Feedback
Search any
task
Search any
task