Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM Agent Evaluation on tau-bench Airline

42Accuracy

Rule-Based (Medium)

31.634.33739.7Mar 9, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
42-------------------613.2985801,96111,616148952
2026.03
38-------------------210.687319517,4723,8951,654554
2026.03
36-------------------012.62394394,7798,798379721
2026.03
36--------------------12.3678-13,577-1,100-
2026.03
34-------------------213.552115710,4373,140770330
2026.03
34-------------------212.43962827,9275,650641459
2026.03
32-------------------410.81266625013,327231,077
2026.03
-604423.1146.78.767945.2713197968488939930111--------
2026.03
-482823.250.2773.1764.586391927781969170111--------
2026.03
-401823.225.67.669.3895.769174993649794528111--------