SOTA Tool-augmented Reasoning benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
PYMATH (test)	GPT-5-Thinking	Final Accuracy71.9	14	3mo ago
BFCL Multi-Turn v3	APIGen-MT	Overall Score69.1	14	4mo ago
API-Bank	GenEnv	Success Rate79.1	12	4mo ago
MINT-Bench	LLAMA PRO - INSTRUCT	Success Rate (Turn 1)9.85	5	4mo ago
General Tool-Augmented LLM Capabilities Qualitative Comparison Survey	-	-	0	4mo ago

Showing 5 of 5 rows