Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
AI-generated content evaluation on AGIN
Loading...
33
F1 Score
BLPO
4.92
12.21
19.5
26.79
Feb 11, 2026
F1 Score
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
F1 Score
Accuracy
BLPO
Judge Model=Llama-4-Ma...
2026.02
33
38
APO-image
Judge Model=Llama-4-Ma...
2026.02
32
37
TextGrad
Judge Model=Llama-4-Ma...
2026.02
30
38
No Optim.
Judge Model=Llama-4-Ma...
2026.02
26
30
OPRO
Judge Model=Llama-4-Ma...
2026.02
26
30
BLPO
Judge Model=Llama-4-Sc...
2026.02
23
28
TextGrad
Judge Model=Llama-4-Sc...
2026.02
21
25
OPRO
Judge Model=Llama-4-Sc...
2026.02
19
26
TextGrad
Judge Model=Qwen2.5-VL...
2026.02
17
22
BLPO
Judge Model=Qwen2.5-VL...
2026.02
17
14
No Optim.
Judge Model=Llama-4-Sc...
2026.02
16
21
OPRO
Judge Model=Qwen2.5-VL...
2026.02
11
9
No Optim.
Judge Model=Qwen2.5-VL...
2026.02
9
8
APO-image
Judge Model=Qwen2.5-VL...
2026.02
9
6
APO-image
Judge Model=Llama-4-Sc...
2026.02
6
8
Feedback
Search any
task
Search any
task