Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ACEBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent PerformanceACEBench Agent
Agent Score78
36
Tool-callingACEBench Extended Setting
Overall Score65.17
18
Tool-callingACEBench Standard Setting
Overall Score68.92
18
Tool UseACEBench Parallel
Accuracy81
15
Tool UseACEBench Single
Accuracy90
15
Multi-turn agent taskACEBench multi-turn (test)
Process Accuracy76.5
15
Agentic PerformanceACEBench Agent
End-to-End Accuracy60
15
Cross-Lingual PlanningACEBench
Score (En)78.3
14
Agent Capability EvaluationACEBench Agent
Multi-Step Reasoning Score95
13
Agentic Tool-useACEBench (agent-task)
Multi Turn Success Rate97.5
13
Function CallingACEBench Normal
Accuracy75.6
13
Function CallingACEBench Normal (test)
Summary Score53
11
Tool-useACEBench
Accuracy61.8
8
Tool UseACEBench-en (out-of-distribution)
Normal Score77.9
8
Multi-turn DialogueACEBench En
MT Accuracy68
7
Agentic PerformanceACEBench-en
End-to-End Accuracy56
7
Agentic PerformanceACEBench-zh
Accuracy89.6
5
Showing 17 of 17 rows