ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

About

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister• 2026

Related benchmarks

Task	Dataset	Result
Scientific Paper Reviewing	Public paper outputs from agentic systems	Mean Rating Score4	8
3D Object Detection	MLE-Bench 3D Object Detection	Score17.63	2
Code Understanding	MLE-Bench AI4Code	Score83.56	2
Fine-grained Recognition	MLE-Bench iNaturalist 2019 FGVC6	Score24.45	2
Medical Image Classification	MLE-Bench RSNA Brain Tumor	Score0.6518	2
Fine-grained Recognition	MLE-Bench iMet 2020 FGVC7	Score67.91	2
Constrained Language Modeling	Parameter Golf OpenAI 2026	Score1.06	1

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord