| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Terminal task completion | Terminal-bench 2.0 | Pass@164.7 | 43 | |
| Terminal task completion | Terminal-bench 1.0 | Pass@151 | 17 | |
| End-to-end terminal tasks | Terminal-Bench 2 | Score49.6 | 13 | |
| Terminal Capability Evaluation | Terminal-Bench 2.0 | Accuracy27.4 | 12 | |
| Coding | Terminal-Bench 2.0 | Score59.3 | 11 | |
| Agentic Terminal Tasks | Terminal-Bench (TB) (test) | Success Rate48.75 | 10 | |
| Agent | Terminal-Bench | Accuracy45 | 8 | |
| Code Agent Simulation | Terminal Bench 2.0 | Accuracy54.2 | 6 | |
| Code Agent | Terminal-Bench Hard | Score39 | 6 | |
| Terminal-based task execution | Terminal-Bench 2.0 | Resolved %65.2 | 5 | |
| Software Engineering Issue Resolution | Terminal Bench | Resolve Rate32.5 | 4 | |
| Agent | Terminal Bench Hard English | Score9.9 | 3 | |
| Agent | Terminal Bench English 1.0 | Score21.8 | 3 | |
| Terminal Task Execution | Terminal-Bench 1.0 (test) | Avg Pass Rate18.9 | 2 | |
| Terminal Capability Evaluation | Terminal-Bench 2.0 (test) | Metric- | 0 |