DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

About

While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of "hypothesize, verify, and analyze". Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at https://github.com/ResearAI/DeepScientist/.

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang• 2025

Related benchmarks

Task	Dataset	Result
Scientific Paper Reviewing	Public paper outputs from agentic systems	Mean Rating Score4.13	8
Research Automation	Three real research tasks Human researcher evaluation	Alignment6.333	7
Automated Research	20 LLM-simulated scientists	Alignment Score4.504	7
Fine-grained Recognition	MLE-Bench iMet 2020 FGVC7	Score68.04	2
3D Object Detection	MLE-Bench 3D Object Detection	Score0.00e+0	2
Code Understanding	MLE-Bench AI4Code	Score69.64	2
Fine-grained Recognition	MLE-Bench iNaturalist 2019 FGVC6	Score21.58	2
Medical Image Classification	MLE-Bench RSNA Brain Tumor	Score0.6377	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord