Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agentic Task Completion on Terminal-Bench Hard 2 (30 tasks)
Loading...
56.7
Pass@1
Codex
32.364
38.682
45
51.318
Apr 28, 2026
Pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
Codex
Base Model=GPT-5.4 (hi...
2026.04
56.7
TF-GRPO
Base Model=GPT-5.4 (hi...
2026.04
55.6
AHE
Base Model=GPT-5.4 (hi...
2026.04
53.3
NexAU0
Base Model=GPT-5.4 (hi...
2026.04
51.7
ACE
Base Model=GPT-5.4 (hi...
2026.04
48.9
terminus-2
Base Model=GPT-5.4 (hi...
2026.04
40
opencode
Base Model=GPT-5.4 (hi...
2026.04
33.3
Feedback
Search any
task
Search any
task