Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

About

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach• 2025

Related benchmarks

TaskDatasetResultRank
Autonomous Machine Learning EngineeringMLE-Bench Lite
Any Medal Rate68.18
57
Machine learning engineeringMLE-bench-30 (test)
Percentile Rank39.5
22
ML EngineeringMLE-Bench official (test)
Medal Rate (Low)55
19
Automated AI ResearchMLE-Bench official (full)
Valid Submission Rate98.2
13
Machine learning engineeringMLE-Bench full official
Medal Rate (Low)55
11
Molecular property predictionTDC ADMET (test)
Avg Rank (Abs)7.83
11
Machine learning engineeringMLE-bench Medium
Medal Rate21.97
5
Machine learning engineeringMLE-bench Low
Medal Rate55
5
Machine learning engineeringMLE-bench (All)
Medal Rate31.6
5
Machine learning engineeringMLE-bench Hard
Medal Rate21.67
5
Showing 10 of 10 rows

Other info

Follow for update