The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs
About
Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | PubMedQA | Accuracy75.6 | 145 | |
| Medical Visual Question Answering | VQA-RAD | Accuracy66.3 | 106 | |
| Question Answering | MedQA USMLE | Accuracy62.18 | 18 | |
| Question Answering | Medbullets-4 | Accuracy60.06 | 15 | |
| Safety Evaluation | CATQA Direct | Safety Score (1-ASR)1 | 8 | |
| Safety Evaluation | CATQA FigStep | Safety Score100 | 8 | |
| Safety Evaluation | HEx-PHI-FigStep | Safety Score (1-ASR)1 | 8 | |
| Medical Question Answering | Medbullets op5 | Accuracy52.27 | 8 | |
| Medical Safety Evaluation | MedSafetyBench Direct | Safety Score98 | 8 | |
| Medical Safety Evaluation | MedSafetyBench FigStep | Safety Score (1-ASR)0.9978 | 8 |