BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

About

Reproducing and comparing deep research agents today is hard: the same backbone evaluated on the same benchmark can report different accuracies across papers because the harness and tool registry differ, and integrating a new model into a comparable evaluation surface costs weeks of model-specific engineering. These are symptoms of a broader reproducibility problem in deep research agent research. Here, we introduce BioMedArena, an open-source toolkit that addresses this reproducibility gap and provides an arena for comparing deep research agents under a shared evaluation environment. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, harness mode, context management, and scoring -- and exposes 166 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool can be accomplished with a few-line provider adapter. Beyond evaluation infrastructure, BioMedArena ships a library of high-quality reference components: 6 agent harnesses (including our proposed Mutual-Evolve) and 6 context-management strategies, any of which can be equipped on any backbone. Equipping these components substantially improves all 12 backbones; on each of 8 representative biomedical benchmarks, the best equipped backbone surpasses prior state-of-the-art (SOTA), by 15.01 percentage points on average. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena.

Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Ayush Noori, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton• 2026

Related benchmarks

Task	Dataset	Result
Biomedical Intelligence Evaluation	BixBench 205 (Evaluation)	Accuracy85.9	25
Laboratory Science Knowledge Evaluation	LAB-Bench 2 821 (Evaluation)	Accuracy82.3	25
High-Level Expert Knowledge Evaluation	HLE Gold 149	Accuracy (Bio)55.1	25
Chemical Reasoning	SuperChem 500 (Total)	Accuracy (Text)72.8	15
Medical Knowledge Evaluation	Medbullets op4 4-option 308	Accuracy92.2	13
Medical Question Answering	MedXpertQA Expert 2450	Accuracy72	13
Protein Language Modeling Evaluation	ProteinLM Bench 944	Accuracy77	13
Medical Question Answering	HealthBench Hard 1000	Accuracy86	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord