Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Machine Learning Engineering on MLE-bench (held-out task instances)
Loading...
58.6
Accuracy (%)
Full ExIt
2.024
16.712
31.4
46.088
Sep 4, 2025
Accuracy (%)
Net improvement (Delta_16)
Updated 3mo ago
Evaluation Results
Method
Method
Links
Accuracy (%)
Net improvement (Delta_16)
Full ExIt
Backbone=DeepSeek-R1-D...
2025.09
58.6
8.4
Diverge (ExIt ablation)
Backbone=DeepSeek-R1-D...
2025.09
57.3
10.1
GRPO + curriculum
Backbone=DeepSeek-R1-D...
2025.09
53
11.9
GRPO
Backbone=DeepSeek-R1-D...
2025.09
48
9.1
Improve (ExIt ablation)
Backbone=DeepSeek-R1-D...
2025.09
47.8
9.4
Base model
Backbone=DeepSeek-R1-D...
2025.09
4.2
2.4
Feedback
Search any
task
Search any
task