Tool Use

Benchmarks

Dataset Name	SOTA Method	Metric
ToolBench	Llama 3.3-70B	Average Success Rate (ASR)99.61	62	1mo ago
ToolBench	Qwen2.5-7B-Instruct-CAST	Average Pass Rate80.67	53	2mo ago
τ-Bench	ProPlay	Average Pass@185.5	45	1mo ago
BFCL		Accuracy94	45	1mo ago
RoTBench Multi-turn	PA-Tool	Tool Selection Accuracy72.9	35	3mo ago
RoTBench Single-turn	PA-Tool	Tool Selection84.8	35	3mo ago
BFCL V4		Accuracy76.5	33	23d ago
MCPMark		Total Success Rate57.5	31	23d ago
ToolBench (test)	AgentHER-MJ	Pass@183.7	28	4mo ago
StableToolBench	ReAct+PLAY2PROMPT	I2 Category Success72.8	28	1mo ago
ToolAlpaca	EMA	Tool Use Success Rate77.9	26	2mo ago
Multi-Calculator	MedCalc-Pro Agent Framework	R-F159.97	24	18d ago
Single-Calculator	MedCalc-Pro Agent Framework	R-F199.62	24	18d ago
tool-use (test)	PBSD	Accuracy72	24	2mo ago
Synthetic Data (test)	RISE	Task Accuracy92.29	24	4mo ago
BFCL Multi-turn		Accuracy54.75	24	4mo ago
RobustBench-TC Perturbed 1.0 (test)	Qwen3-14B	Accuracy (Perturbed)52.9	21	2mo ago
RobustBench-TC Clean 1.0 (test)	LoopTool-32B	Clean Accuracy77.9	21	2mo ago
Aggregate Performance	NRE	Avg1 Score76.8	20	18d ago
LiveMCPBench	GEPA	LiveMCP Score77	20	18d ago
BFCL		BFCL v4 Score77	20	18d ago
AgentHarm (public test)		TSR50	20	1mo ago
Evaluation dataset	PORTool	Accuracy51.98	20	2mo ago
BFCL v3 (test)	PRS	Base Score31	19	1mo ago
MCP-Atlas		Pass Rate65.2	19	1mo ago

Showing 25 of 138 rows