Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use Agent Evaluation on τ-bench retail domain (All 115 tasks)
Loading...
43.9
Pass@1
POLCA
38.7
40.05
41.4
42.75
Mar 16, 2026
Pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
POLCA
2026.03
43.9
GEPA
2026.03
42.9
OpenEvolve
2026.03
41.8
Base Prompt
2026.03
38.9
Feedback
Search any
task
Search any
task