| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Terminal task completion | Terminal-bench 2.0 | Pass@164.7 | 63 | |
| Terminal-based task execution | Terminal-Bench 2.0 | Accuracy64.7 | 19 | |
| Agentic Coding | Terminal Bench 2.0 | Pass@159.1 | 18 | |
| Terminal task completion | Terminal-bench 1.0 | Pass@151 | 17 | |
| Software Engineering Reasoning | Terminal-Bench (TB2) 2.0 | Resolution Rate36 | 16 | |
| End-to-end terminal tasks | Terminal-Bench 2 | Score49.6 | 13 | |
| Terminal-based problem solving | Terminal-Bench 2 (out-of-distribution) | Task Success Rate34.12 | 12 | |
| Terminal Task Execution | Terminal-Bench 1.0 | Accuracy51 | 12 | |
| Terminal Capability Evaluation | Terminal-Bench 2.0 | Accuracy27.4 | 12 | |
| Agent | Terminal-Bench | Accuracy45 | 12 | |
| Code Agent | Terminal-Bench Hard | Score57.6 | 12 | |
| Terminal task | Terminal Bench Pro | Pass@137 | 11 | |
| Terminal task | Terminal Bench 1.0 | Pass@133.44 | 11 | |
| Coding | Terminal-Bench 2.0 | Score59.3 | 11 | |
| Agentic Terminal Tasks | Terminal-Bench (TB) (test) | Success Rate48.75 | 10 | |
| Coding | Terminal-Bench 1.1 | Resolved Rate43 | 9 | |
| Terminal-based agentic execution | Terminal-Bench | Score53.8 | 7 | |
| Agentic Task Completion | Terminal-Bench Hard 2 (30 tasks) | Pass@156.7 | 7 | |
| Agentic Task Completion | Terminal-Bench Med. 2 (55 tasks) | Pass@188.2 | 7 | |
| Agentic Task Completion | Terminal-Bench Easy 4 tasks 2 | Pass@1100 | 7 | |
| Agentic Task Completion | Terminal-Bench All 2 | Pass@177 | 7 | |
| Agentic Capability | Terminal Bench | Pass@123 | 7 | |
| Command-line Interface Tasks | Terminal-Bench 2.0 | Terminus2 JSON Score57.3 | 7 | |
| Terminal task execution | Terminal-Bench 2.0 (full) | Overall avg@5 Accuracy58.9 | 6 | |
| Agentic Terminal Tasks | Terminal-Bench out-of-distribution 2.0 (test) | Avg@539.4 | 6 |