| ToolBench | Attention Buckets + DFSDT-Retriever | Average Pass Rate71.3 | | 29 | 4d ago |
| StableToolBench | ReAct+PLAY2PROMPT | I2 Category Success72.8 | | 28 | 4d ago |
| Synthetic Data (test) | RISE | Task Accuracy92.29 | | 24 | 4d ago |
| BFCL Multi-turn | | Accuracy54.75 | | 24 | 4d ago |
| ToolBench | | Energy (Wh)5.6 | | 18 | 4d ago |
| ToolBench | Mistral Large | Average Token Length127 | | 18 | 4d ago |
| Tool-use domain Aggregate | Agent-Dice | AvgZ Score0.79 | | 18 | 4d ago |
| Tool-use domain Subset 3 | | Functionality Score100 | | 18 | 4d ago |
| Tool-use domain Subset 2 | | Func Success Rate99.63 | | 18 | 4d ago |
| Tool-use domain Subset 1 | | Func99.63 | | 18 | 4d ago |
| Tool-use domain Subset 0 | | Func Success Rate99.64 | | 18 | 4d ago |
| Tool Use Non-Live | DARE+TA | Para0.925 | | 15 | 4d ago |
| Tool Use Live | Base | Para Score56.25 | | 15 | 4d ago |
| Tool use | SDPO (on-policy) | Avg@1668.5 | | 14 | 4d ago |
| Tau-Bench | | TAU-AIR Score67.5 | | 14 | 4d ago |
| Task-Bench | Reasoningbank | Task Completion Rate58.2 | | 14 | 4d ago |
| StableToolBench cost-augmented | INTENT | PR76 | | 14 | 4d ago |
| ToolBench 50 APIs v1 (test) | Llama-2-Chat-7B | Wellformedness99.2 | | 14 | 4d ago |
| StableToolBench G1 Category | Trace-Based | SL76.8 | | 12 | 4d ago |
| BFCL | Qwen 3 VL 32B Instruct | Accuracy66.3 | | 12 | 4d ago |
| LitQA 2 | Olmo 3.1 32B Instruct | Accuracy55.6 | | 12 | 4d ago |
| SimpleQA | Qwen 3 VL 32B Instruct | Accuracy91.5 | | 12 | 4d ago |
| BrowseComp Domains (Domain-specific (9) + Full Search) | | Accuracy27.8 | | 10 | 4d ago |
| BrowseComp Domain-specific (9) Search | | Accuracy22.5 | | 10 | 4d ago |
| BFCL Single-Turn | AWPO | OA84.11 | | 10 | 4d ago |