Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Early-stopping budget estimation on Search-R1, Sokoban, SWE-bench, and Warehouse Aggregate (test)
Loading...
2.2
False-Abort Rate
Claude Opus 4.7
2.024
3.212
4.4
5.588
May 29, 2026
False-Abort Rate
Saved Tokens
False-Abort Count
Stopped Failed Rollouts Count
Updated 1d ago
Evaluation Results
Method
Method
Links
False-Abort Rate
Saved Tokens
False-Abort Count
Stopped Failed Rollouts Count
Claude Opus 4.7
Model=Claude Opus 4.7
2026.05
2.2
28.2
50
62
Gemini 3.1 Pro
Model=Gemini 3.1 Pro
2026.05
2.8
55.7
63
123
Claude Sonnet 4.6
Model=Claude Sonnet 4.6
2026.05
3.3
49.6
76
101
Qwen3 235B
Model=Qwen3 235B
2026.05
4.9
38.8
190
140
GPT-5.2 Instant
Model=GPT-5.2 Instant
2026.05
6.6
64.1
183
124
Feedback
Search any
task
Search any
task