| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Terminal-related CLI agent task | TerminalBench 2.0 | Accuracy57.8 | 29 | |
| Terminal-related CLI agent task | TerminalBench 1.0 | Accuracy54.38 | 29 | |
| Terminal Agentic Trajectory Generation | TerminalBench 2.0 | Score57.8 | 29 | |
| Terminal Agentic Trajectory Generation | TerminalBench 1.0 | Score56.25 | 23 | |
| Agentic Coding | TerminalBench 2 | Pass Rate81.8 | 17 | |
| Prefix-risk ranking | TerminalBench (held-out) | AUPRC0.557 | 11 | |
| Vulnerability Discovery | TerminalBench 2 snapshot 2026-04-17 | Score (%)84.3 | 11 | |
| Multimodal Agentic Tool Use | TerminalBench-O | Pass Rate24 | 9 | |
| Terminal Agent Tasks | TerminalBench 2.0 | Pass@1 Rate10.79 | 9 | |
| Code Generation | TerminalBench 2 | Pass@339.3 | 9 | |
| Agentic Coding | TerminalBench | Accuracy0.3375 | 7 | |
| Terminal Command Execution | TerminalBench | Success Rate98.1 | 6 | |
| Terminal-based Tool Use | TerminalBench (TBench) | Pass@157.2 | 5 | |
| Ranking Preservation | TerminalBench (test) | Mean Spearman Rho0.988 | 5 | |
| Terminal Task Execution | TerminalBench 2.0 (test) | Test Accuracy35.2 | 4 | |
| Terminal Agentic Trajectory Generation | TerminalBench | Pass@845 | 4 |