Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TerminalBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Terminal-related CLI agent taskTerminalBench 2.0
Accuracy57.8
29
Terminal-related CLI agent taskTerminalBench 1.0
Accuracy54.38
29
Terminal Agentic Trajectory GenerationTerminalBench 2.0
Score57.8
29
Terminal Agentic Trajectory GenerationTerminalBench 1.0
Score56.25
23
Agentic CodingTerminalBench 2
Pass Rate81.8
17
Prefix-risk rankingTerminalBench (held-out)
AUPRC0.557
11
Vulnerability DiscoveryTerminalBench 2 snapshot 2026-04-17
Score (%)84.3
11
Multimodal Agentic Tool UseTerminalBench-O
Pass Rate24
9
Terminal Agent TasksTerminalBench 2.0
Pass@1 Rate10.79
9
Code GenerationTerminalBench 2
Pass@339.3
9
Agentic CodingTerminalBench
Accuracy0.3375
7
Terminal Command ExecutionTerminalBench
Success Rate98.1
6
Terminal-based Tool UseTerminalBench (TBench)
Pass@157.2
5
Ranking PreservationTerminalBench (test)
Mean Spearman Rho0.988
5
Terminal Task ExecutionTerminalBench 2.0 (test)
Test Accuracy35.2
4
Terminal Agentic Trajectory GenerationTerminalBench
Pass@845
4
Showing 16 of 16 rows