Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

About

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton• 2026

Related benchmarks

TaskDatasetResultRank
Biomedical Intelligence EvaluationBixBench 205 (Evaluation)
Accuracy85.9
25
Laboratory Science Knowledge EvaluationLAB-Bench 2 821 (Evaluation)
Accuracy82.3
25
High-Level Expert Knowledge EvaluationHLE Gold 149
Accuracy (Bio)55.1
25
Chemical ReasoningSuperChem 500 (Total)
Accuracy (Text)72.8
15
Medical Knowledge EvaluationMedbullets op4 4-option 308
Accuracy92.2
13
Medical Question AnsweringMedXpertQA Expert 2450
Accuracy72
13
Protein Language Modeling EvaluationProteinLM Bench 944
Accuracy77
13
Medical Question AnsweringHealthBench Hard 1000
Accuracy86
12
Showing 8 of 8 rows

Other info

Follow for update