Share your thoughts, 1 month free Claude Pro on usSee more

End-to-end terminal tasks on Terminal-Bench 2

49.6Score

GPT-5

Updated 4mo ago

Evaluation Results

Method	Links
GPT-5 2025.12		49.6
Claude-Sonnet-4.5 2025.12		42.8
Kimi-K2-thinking 2025.12		35.7
Gemini-2.5-pro 2025.12		32.6
DeepSeek-V3.1-Nex-N1 2025.12		31.8
Minimax-M2 2025.12		30
GLM-4.6 2025.12		24.5
DeepSeek-V3.1 2025.12		22.2
Qwen3-32B-Nex-N1 2025.12		16.7
Qwen3-30B-A3B-Nex-N1 2025.12		8.3
Qwen3-32B 2025.12		7.9
Qwen3-30B-A3B 2025.12		6
InternLM3-8B-Nex-N1 2025.12		0