Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-Horizon Evaluation with Simulator Feedback on 10 manifest-defined shared-core domains (de novo)
Loading...
29
Success Rate (1-turn)
Sonnet 4.5
12.152
16.526
20.9
25.274
Mar 13, 2026
Success Rate (1-turn)
Success Rate (5-turn)
Success Rate (20-turn)
Success Rate Delta (1-turn to 5-turn)
Success Rate Delta (5-turn to 20-turn)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Success Rate (1-turn)
Success Rate (5-turn)
Success Rate (20-turn)
Success Rate Delta (1-turn to 5-turn)
Success Rate Delta (5-turn to 20-turn)
Sonnet 4.5
Protocol=5-turn (3 att...
2026.03
29
66.5
69.5
37.5
3
GPT-5.2
Protocol=5-turn (3 att...
2026.03
25.6
62.5
67
36.9
4.5
Gemini 3.1 Pro
Protocol=5-turn (3 att...
2026.03
24
57.3
60.5
33.4
3.2
Opus 4.6
Protocol=5-turn (3 att...
2026.03
23.8
65.5
76
41.7
10.5
Sonnet 4.6
Protocol=5-turn (3 att...
2026.03
22.9
66.5
71.5
43.6
5
Gemini 2.0 Flash
Protocol=5-turn (3 att...
2026.03
19.5
43.2
60.5
23.7
17.3
GPT-4o
Protocol=5-turn (3 att...
2026.03
12.8
53
55
40.2
2
Feedback
Search any
task
Search any
task