Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Enterprise Task Completion on EnterpriseBench
Loading...
0.47
Execution Score
GPT-4o
0.2412
0.3006
0.36
0.4194
Mar 23, 2026
Execution Score
Updated 25d ago
Evaluation Results
Method
Method
Links
Execution Score
GPT-4o
Evaluator=GPT-4o
2026.03
0.47
GPT-4o
Evaluator model=Claude...
2026.03
0.44
ToolAce
Evaluator=GPT-4o
2026.03
0.41
XLAM-2-70B
Evaluator=GPT-4o
2026.03
0.4
ToolAce
Evaluator model=Claude...
2026.03
0.39
XLAM-2-70B
Evaluator model=Claude...
2026.03
0.39
Qwen3-4B (Agentic GRPO)
Evaluator=GPT-4o, Vari...
2026.03
0.38
Qwen3-4B (Agentic GRPO)
Evaluator model=Claude...
2026.03
0.36
Qwen3-4B (SFT)
Evaluator=GPT-4o, Vari...
2026.03
0.32
Qwen3-4B (SFT)
Evaluator model=Claude...
2026.03
0.31
Qwen3-4B (Base)
Evaluator=GPT-4o, Vari...
2026.03
0.27
Qwen3-4B (Base)
Evaluator model=Claude...
2026.03
0.25
Feedback
Search any
task
Search any
task