Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

About

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister• 2026

Related benchmarks

TaskDatasetResultRank
3D Object DetectionMLE-Bench 3D Object Detection
Score17.63
2
Code UnderstandingMLE-Bench AI4Code
Score83.56
2
Fine-grained RecognitionMLE-Bench iNaturalist 2019 FGVC6
Score24.45
2
Medical Image ClassificationMLE-Bench RSNA Brain Tumor
Score0.6518
2
Fine-grained RecognitionMLE-Bench iMet 2020 FGVC7
Score67.91
2
Constrained Language ModelingParameter Golf OpenAI 2026
Score1.06
1
Showing 6 of 6 rows

Other info

GitHub

Follow for update