| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Terminal task completion | Terminal-bench 2.0 | Pass@164.7 | 52 | |
| Terminal task completion | Terminal-bench 1.0 | Pass@151 | 17 | |
| End-to-end terminal tasks | Terminal-Bench 2 | Score49.6 | 13 | |
| Terminal Capability Evaluation | Terminal-Bench 2.0 | Accuracy27.4 | 12 | |
| Code Agent | Terminal-Bench Hard | Score57.6 | 12 | |
| Coding | Terminal-Bench 2.0 | Score59.3 | 11 | |
| Agentic Terminal Tasks | Terminal-Bench (TB) (test) | Success Rate48.75 | 10 | |
| Agent | Terminal-Bench | Accuracy45 | 8 | |
| Agentic Capability | Terminal Bench | Pass@123 | 7 | |
| Command-line Interface Tasks | Terminal-Bench 2.0 | Terminus2 JSON Score57.3 | 7 | |
| Code Agent Simulation | Terminal Bench 2.0 | Accuracy54.2 | 6 | |
| Terminal Task Execution | Terminal-Bench 1.0 (test) | Avg Pass Rate34.9 | 6 | |
| Agentic Coding | Terminal Bench 2.0 | Pass@154.2 | 5 | |
| Terminal-based task execution | Terminal-Bench 2.0 | Resolved %65.2 | 5 | |
| Agentic | Terminal Bench 2.0 | Pass@140.5 | 4 | |
| Software Engineering Issue Resolution | Terminal Bench | Resolve Rate32.5 | 4 | |
| Agentic Reasoning | Terminal Bench Core 2.0 | Success Rate37.5 | 3 | |
| Agentic Reasoning | Terminal Bench hard | Success Rate26.8 | 3 | |
| Success Prediction | Terminal-Bench 2.0 (held-out agent data) | AUC-ROC0.933 | 3 | |
| Skill Retrieval | Terminal-Bench | Mean Skills Retrieved per Task1.5 | 3 | |
| Agent | Terminal Bench Hard English | Score9.9 | 3 | |
| Agent | Terminal Bench English 1.0 | Score21.8 | 3 | |
| Terminal Command Execution | Terminal-Bench Core 1.1 | Accuracy34.17 | 2 | |
| Terminal Capability Evaluation | Terminal-Bench 2.0 (test) | Metric- | 0 |