Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Task Completion on PinchBench
Loading...
88.7
Pass@1
Gemini-3-Flash
47.828
58.439
69.05
79.661
Apr 29, 2026
Pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
Gemini-3-Flash
Model Group=Proprietar...
2026.04
88.7
ClawGym-30A3B
Model Group=ClawGym-Ag...
2026.04
86
MiniMax-M2.7
Model Group=Open-Weigh...
2026.04
80.9
Claude-4.7-Opus
Model Group=Proprietar...
2026.04
79.4
Claude-4.6-Sonnet
Model Group=Proprietar...
2026.04
79.3
Qwen3.5-Plus
Model Group=Open-Weigh...
2026.04
78.7
GLM-5.1
Model Group=Open-Weigh...
2026.04
76.4
ClawGym-4B
Model Group=ClawGym-Ag...
2026.04
76.4
ClawGym-8B
Model Group=ClawGym-Ag...
2026.04
75.7
Claude-4.6-Opus
Model Group=Proprietar...
2026.04
75.3
GPT-5.4
Model Group=Proprietar...
2026.04
68.3
Kimi-K2.6
Model Group=Open-Weigh...
2026.04
66.5
DeepSeek-V3.2
Model Group=Open-Weigh...
2026.04
60.8
Qwen3-235A23B
Model Group=Compact Op...
2026.04
60.6
Qwen3-30A3B
Model Group=Compact Op...
2026.04
55.6
Qwen3-8B
Model Group=Compact Op...
2026.04
54.5
Qwen3-32B
Model Group=Compact Op...
2026.04
49.4
Feedback
Search any
task
Search any
task