Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use Agent Evaluation on τ-bench retail domain (First 10 tasks)
Loading...
57.5
Pass@1 Success Rate
POLCA
33.892
40.021
46.15
52.279
Mar 16, 2026
Pass@1 Success Rate
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1 Success Rate
POLCA
2026.03
57.5
GEPA
2026.03
55.7
OpenEvolve
2026.03
37.3
Base Prompt
2026.03
34.8
Feedback
Search any
task
Search any
task