| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tool-calling | When2Call | F1 Score76.8 | 42 | |
| Tool-use gating | When2Call | TC Accuracy99.23 | 30 | |
| Multiple Choice Classification | When2Call | Accuracy78.63 | 24 | |
| Temporal Reasoning | When2Call | Performance Score100 | 8 | |
| Social Reasoning | When2Call | Accuracy54.5 | 5 | |
| Decision Making Reasoning | When2Call | Cumulative Score (CS)79 | 4 |