Share your thoughts, 1 month free Claude Pro on usSee more

Machine Learning Engineering on MLE-bench (held-out task instances)

58.6Accuracy (%)

Full ExIt

Updated 5mo ago

Evaluation Results

Method	Links
Full ExIt 2025.09		58.6	8.4
Diverge (ExIt ablation) 2025.09		57.3	10.1
GRPO + curriculum 2025.09		53	11.9
GRPO 2025.09		48	9.1
Improve (ExIt ablation) 2025.09		47.8	9.4
Base model 2025.09		4.2	2.4