Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Terminal-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Terminal task completionTerminal-bench 2.0
Pass@164.7
63
Terminal-based task executionTerminal-Bench 2.0
Accuracy64.7
19
Agentic CodingTerminal Bench 2.0
Pass@159.1
18
Terminal task completionTerminal-bench 1.0
Pass@151
17
Software Engineering ReasoningTerminal-Bench (TB2) 2.0
Resolution Rate36
16
End-to-end terminal tasksTerminal-Bench 2
Score49.6
13
Terminal-based problem solvingTerminal-Bench 2 (out-of-distribution)
Task Success Rate34.12
12
Terminal Task ExecutionTerminal-Bench 1.0
Accuracy51
12
Terminal Capability EvaluationTerminal-Bench 2.0
Accuracy27.4
12
AgentTerminal-Bench
Accuracy45
12
Code AgentTerminal-Bench Hard
Score57.6
12
Terminal taskTerminal Bench Pro
Pass@137
11
Terminal taskTerminal Bench 1.0
Pass@133.44
11
CodingTerminal-Bench 2.0
Score59.3
11
Agentic Terminal TasksTerminal-Bench (TB) (test)
Success Rate48.75
10
CodingTerminal-Bench 1.1
Resolved Rate43
9
Terminal-based agentic executionTerminal-Bench
Score53.8
7
Agentic Task CompletionTerminal-Bench Hard 2 (30 tasks)
Pass@156.7
7
Agentic Task CompletionTerminal-Bench Med. 2 (55 tasks)
Pass@188.2
7
Agentic Task CompletionTerminal-Bench Easy 4 tasks 2
Pass@1100
7
Agentic Task CompletionTerminal-Bench All 2
Pass@177
7
Agentic CapabilityTerminal Bench
Pass@123
7
Command-line Interface TasksTerminal-Bench 2.0
Terminus2 JSON Score57.3
7
Terminal task executionTerminal-Bench 2.0 (full)
Overall avg@5 Accuracy58.9
6
Agentic Terminal TasksTerminal-Bench out-of-distribution 2.0 (test)
Avg@539.4
6
Showing 25 of 40 rows