Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Web-retrieval Task Completion on FutureX
Loading...
49.5
Pass@1
Multi agent
24.644
31.097
37.55
44.003
Jun 1, 2026
Pass@1
Updated 1d ago
Evaluation Results
Method
Method
Links
Pass@1
Multi agent
System Category=Ours
2026.06
49.5
A-Evolve
System Category=Auto-h...
2026.06
47.5
Full System
System Category=Ours
2026.06
47.3
Adaptive
System Category=Ours
2026.06
44.1
Cont. Harness
System Category=Auto-h...
2026.06
31.8
DeepSeek
System Category=No evo...
2026.06
31.2
Sonnet
System Category=No evo...
2026.06
31
Haiku
System Category=No evo...
2026.06
31
GLM
System Category=No evo...
2026.06
30.8
SkillOS
System Category=Auto-h...
2026.06
29.8
Meta Harness
System Category=Auto-h...
2026.06
29.4
GEPA
System Category=Auto-h...
2026.06
28.2
Kimi
System Category=No evo...
2026.06
27.8
OctoTools
System Category=Human
2026.06
25.6
Feedback
Search any
task
Search any
task