Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Preference evaluation on ImageReward
Loading...
34
F1 Score
BLPO
13.2
18.6
24
29.4
Feb 11, 2026
F1 Score
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
F1 Score
Accuracy
BLPO
Judge Model=Llama-4-Sc...
2026.02
34
36
OPRO
Judge Model=Llama-4-Sc...
2026.02
32
34
BLPO
Judge Model=Llama-4-Ma...
2026.02
32
35
APO-image
Judge Model=Llama-4-Sc...
2026.02
31
37
OPRO
Judge Model=Llama-4-Ma...
2026.02
31
33
BLPO
Judge Model=Qwen2.5-VL...
2026.02
29
34
TextGrad
Judge Model=Llama-4-Sc...
2026.02
27
24
TextGrad
Judge Model=Llama-4-Ma...
2026.02
27
29
TextGrad
Judge Model=Qwen2.5-VL...
2026.02
26
28
APO-image
Judge Model=Llama-4-Ma...
2026.02
25
26
OPRO
Judge Model=Qwen2.5-VL...
2026.02
24
27
APO-image
Judge Model=Qwen2.5-VL...
2026.02
23
27
No Optim.
Judge Model=Llama-4-Sc...
2026.02
21
29
No Optim.
Judge Model=Qwen2.5-VL...
2026.02
19
25
No Optim.
Judge Model=Llama-4-Ma...
2026.02
14
22
Feedback
Search any
task
Search any
task