Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Agentic Task Performance on τ2-Bench Airline 1.0 (test)

96.4CAP

NOD

62.18471.06779.9588.833May 12, 2026
Updated 21d ago

Evaluation Results

MethodLinks
2026.05
96.447.3
2026.05
9648.7
2026.05
9652
2026.05
95.338
2026.05
95.343.3
2026.05
94.746
2026.05
93.739.3
2026.05
92.637.3
2026.05
91.943.3
2026.05
91.743.3
2026.05
91.450.7
2026.05
91.242.7
2026.05
90.743.3
2026.05
90.342
2026.05
90.342.7
2026.05
9040
2026.05
89.748
2026.05
89.340.7
2026.05
89.140.7
2026.05
8936.7
2026.05
88.844.7
2026.05
88.842
2026.05
88.735.3
2026.05
88.340.7
2026.05
87.242.7
2026.05
8748
2026.05
8735.3
2026.05
86.942.7
2026.05
8636
2026.05
85.944
2026.05
85.437.3
2026.05
84.736
2026.05
84.730.7
2026.05
84.346
2026.05
84.341.3
2026.05
84.339.3
2026.05
83.836.7
2026.05
83.634
2026.05
83.437.3
2026.05
82.331.3
2026.05
78.332.7
2026.05
78.334
2026.05
77.229.3
2026.05
76.730
2026.05
76.231.3
2026.05
64.222
2026.05
63.918.7
2026.05
63.521.3