Share your thoughts, 1 month free Claude Pro on usSee more

Agent Execution on tau-Bench (test)

59Execution Accuracy

Gemini-2.5 Pro

Updated 4mo ago

Evaluation Results

Method	Links
Gemini-2.5 Pro 2026.03		59
Claude-3.5-Sonnet 2026.03		56
GPT-4o 2026.03		54
Qwen3-8B Agentic GRPO 2026.03		42
Qwen3-8B SFT 2026.03		36
Qwen3-8B Base 2026.03		33
xLAM-2-70B 2026.03		17
ToolAce 2026.03		15