Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-agent policy synthesis on Cleanup
Loading...
2.75
U Score
Gemini 3.1 Pro
-0.006
0.7095
1.425
2.1405
Mar 19, 2026
U Score
E Score
S Score
Updated 2mo ago
Evaluation Results
Method
Method
Links
U Score
E Score
S Score
Gemini 3.1 Pro
Feedback=reward+social
2026.03
2.75
0.54
432.6
Gemini 3.1 Pro
Feedback=reward-only
2026.03
1.79
0.13
386
Claude Sonnet 4.6
Feedback=reward+social
2026.03
1.37
0.09
294.6
Claude Sonnet 4.6
Feedback=reward-only
2026.03
1.14
0.47
233
Claude Sonnet 4.6
Feedback=zero-shot
2026.03
1.01
3.06
137
GEPA (Gemini 3.1 Pro)
2026.03
0.77
1.75
209.5
Gemini 3.1 Pro
Feedback=zero-shot
2026.03
0.45
0.45
274.1
Q-learner
2026.03
0.16
0.2
208.6
BFS Collector
2026.03
0.1
0.61
16.4
Feedback
Search any
task
Search any
task