Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Agentic Task Evaluation on τ2-Bench Retail (avg@4, pass@4)

69.7Avg@4

Snapshot

63.730465.280266.8368.3798May 12, 2026
Updated 21d ago

Evaluation Results

MethodLinks
2026.05
69.792.1
2026.05
67.8292.1
2026.05
66.2389.47
2026.05
65.887.7
2026.05
65.7290.35
2026.05
65.4389.47
2026.05
64.486.84
2026.05
63.9688.6