Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Discovery on AIME Agent
Loading...
37.78
Step 1 Mean
COAT
33.152
34.3535
35.555
36.7565
Mar 15, 2026
Step 1 Mean
Step 1 Best
Step 2 Mean
Step 2 Best
Step 3 Mean
Step 3 Best
Step 4 Mean
Step 4 Best
Updated 1mo ago
Evaluation Results
Method
Method
Links
Step 1 Mean
Step 1 Best
Step 2 Mean
Step 2 Best
Step 3 Mean
Step 3 Best
Step 4 Mean
Step 4 Best
COAT
Backbone=Grok-4.1-FR
2026.03
37.78
43.33
37.78
43.33
37.78
43.33
38.89
43.33
CausalPlanner (Meta)
Backbone=Grok-4.1-FR
2026.03
34.44
36.67
35.56
36.67
36.67
40
36.67
40
ShinkaEvolve
Backbone=Grok-4.1-FR
2026.03
33.33
33.33
34.44
36.67
34.44
36.67
34.44
36.67
CausalEvolve
Backbone=Grok-4.1-FR
2026.03
33.33
36.67
38.89
40
38.89
40
38.89
40
Feedback
Search any
task
Search any
task