| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| PYMATH (test) | GPT-5-Thinking | Final Accuracy71.9 | 14 | 1mo ago | |
| BFCL Multi-Turn v3 | APIGen-MT | Overall Score69.1 | 14 | 3mo ago | |
| API-Bank | GenEnv | Success Rate79.1 | 12 | 3mo ago | |
| MINT-Bench | LLAMA PRO - INSTRUCT | Success Rate (Turn 1)9.85 | 5 | 3mo ago | |
| General Tool-Augmented LLM Capabilities Qualitative Comparison Survey | - | - | 0 | 3mo ago |