| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Macro-average Exact Match | ShellOps-Pro | Macro-average Exact Match Accuracy53.3 | 36 | |
| HYBRID | ShellOps | Combined Score52 | 9 | |
| FILES | ShellOps | Diff Recall58.3 | 9 | |
| STRING | ShellOps | LLM Judge Accuracy49.1 | 9 | |
| Agentic Task Solving | ShellOps | Pass@30.462 | 9 | |
| Hybrid Operations | ShellOps | Exact Match24.6 | 9 | |
| File Editing | ShellOps | Exact Match26.5 | 9 | |
| String Extraction | ShellOps | Exact Match48.5 | 9 |