Baichuan-M2: Scaling Medical Capability with Large Verifier System
About
As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA (test) | Accuracy68.49 | 134 | |
| Multiple-choice Question Answering | MMLU-Pro | MMLU-Pro Overall Accuracy50.56 | 116 | |
| Question Answering | MedQA-USMLE (test) | Accuracy84.68 | 101 | |
| Natural Language Inference | MedNLI (test) | Accuracy60.6 | 89 | |
| Question Answering | PubMedQA (test) | Accuracy79.2 | 81 | |
| Medical Question Answering | MedExpQA | Accuracy (English)83.04 | 61 | |
| Natural Language Inference | BioNLI | Accuracy (Chinese)62.71 | 56 | |
| Multilingual Multiple-Choice Question Answering | HeadQA 1.0 (test) | Chinese Acc84.76 | 56 | |
| Creative Writing | Arena-Hard Creative Writing v2 | Score69.2 | 25 | |
| Medical Calculation and Tool Use | MedMCP-Calc Neurology & Psychiatry | CS36.02 | 25 |