Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Agent Performance on tau-bench

78.3Retail Accuracy

GPT-5-think

10.38828.01945.6563.281Dec 28, 2025Jan 4, 2026Jan 11, 2026Jan 19, 2026Jan 26, 2026Feb 2, 2026Feb 10, 2026
Updated 4d ago

Evaluation Results

MethodLinks
78.34467.8-
2025.12
78.344--
73.94063.6-
2026.02
73.951.267-
2025.12
73.940--
2025.12
73.951.2--
2025.12
73.156.5--
2026.02
735467.2-
2025.12
72.646.5--
2026.02
70.45264.8-
70.44663-
2026.02
70.44663-
2026.02
70.45465.4-
2025.12
70.452--
2025.12
70.446--
2025.12
70.446--
2025.12
70.454--
68.74461.2-
2026.02
68.74862.4-
2025.12
68.744--
2025.12
68.748--
2026.02
67.849.262.1-
2026.02
67.84661.2-
2026.02
67.84861.8-
2025.12
67.849.2--
2025.12
67.846--
2025.12
67.848--
2026.02
67.145.260.4-
2025.12
67.145.2--
2026.02
66.14058.1-
2025.12
66.140--
2026.02
64.34558.4-
2026.02
64.35461.2-
2025.12
64.345--
2026.02
62.65460-
2026.02
604655.7-
2026.02
59.152.557.1-
2026.02
59.13652.1-
2026.02
58.235.251.2-
2026.02
57.43650.9-
2026.02
55.72646.6-
2026.02
53.93247.2-
2026.02
532845.4-
2026.02
50.44247.8-
2026.02
48.73444.2-
2026.02
48.73644.8-
2026.02
472440-
2026.02
45.73141.2-
2026.02
45.22539-
2026.02
44.44043.1-
2026.02
43.52638.2-
2026.02
37.42232.7-
2026.02
271021.8-
2026.02
14.81815.8-
2026.02
131413.3-
2026.01
---80.3
2026.01
---74.3
2026.01
---80.3
2026.01
---85.4
2026.01
---84.7
2026.01
---80.2