Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Terminal-based task execution on Terminal-Bench 2.0

64.7Accuracy

GPT-5.3-Codex

11.45225.27639.152.924Apr 28, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.04
64.7-
2026.04
62.9-
2026.04
57.8-
2026.04
56.9-
2026.04
54-
2026.04
52.4-
2026.04
47.6-
2026.04
43.2-
2026.04
42.8-
2026.04
42.2-
2026.04
39.6-
2026.04
38-
2026.04
35.2-
2026.04
29.6-
2026.04
28.3-
2026.04
24-
2026.04
23.9-
2026.04
19.9-
2026.04
13.5-
2026.02
-65.2
2026.02
-64.7
2026.02
-62.9
2026.02
-52.1
2026.02
-51.9