BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
About
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Biomedical Intelligence Evaluation | BixBench 205 (Evaluation) | Accuracy85.9 | 25 | |
| Laboratory Science Knowledge Evaluation | LAB-Bench 2 821 (Evaluation) | Accuracy82.3 | 25 | |
| High-Level Expert Knowledge Evaluation | HLE Gold 149 | Accuracy (Bio)55.1 | 25 | |
| Chemical Reasoning | SuperChem 500 (Total) | Accuracy (Text)72.8 | 15 | |
| Medical Knowledge Evaluation | Medbullets op4 4-option 308 | Accuracy92.2 | 13 | |
| Medical Question Answering | MedXpertQA Expert 2450 | Accuracy72 | 13 | |
| Protein Language Modeling Evaluation | ProteinLM Bench 944 | Accuracy77 | 13 | |
| Medical Question Answering | HealthBench Hard 1000 | Accuracy86 | 12 |