Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Terminal-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Terminal task completionTerminal-bench 2.0
Pass@164.7
52
Terminal task completionTerminal-bench 1.0
Pass@151
17
End-to-end terminal tasksTerminal-Bench 2
Score49.6
13
Terminal Capability EvaluationTerminal-Bench 2.0
Accuracy27.4
12
Code AgentTerminal-Bench Hard
Score57.6
12
CodingTerminal-Bench 2.0
Score59.3
11
Agentic Terminal TasksTerminal-Bench (TB) (test)
Success Rate48.75
10
AgentTerminal-Bench
Accuracy45
8
Agentic CapabilityTerminal Bench
Pass@123
7
Command-line Interface TasksTerminal-Bench 2.0
Terminus2 JSON Score57.3
7
Code Agent SimulationTerminal Bench 2.0
Accuracy54.2
6
Terminal Task ExecutionTerminal-Bench 1.0 (test)
Avg Pass Rate34.9
6
Agentic CodingTerminal Bench 2.0
Pass@154.2
5
Terminal-based task executionTerminal-Bench 2.0
Resolved %65.2
5
AgenticTerminal Bench 2.0
Pass@140.5
4
Software Engineering Issue ResolutionTerminal Bench
Resolve Rate32.5
4
Agentic ReasoningTerminal Bench Core 2.0
Success Rate37.5
3
Agentic ReasoningTerminal Bench hard
Success Rate26.8
3
Success PredictionTerminal-Bench 2.0 (held-out agent data)
AUC-ROC0.933
3
Skill RetrievalTerminal-Bench
Mean Skills Retrieved per Task1.5
3
AgentTerminal Bench Hard English
Score9.9
3
AgentTerminal Bench English 1.0
Score21.8
3
Terminal Command ExecutionTerminal-Bench Core 1.1
Accuracy34.17
2
Terminal Capability EvaluationTerminal-Bench 2.0 (test)
Metric-
0
Showing 24 of 24 rows