CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation

About

Accurate symptom-to-disease classification and clinically grounded treatment recommendations remain challenging, particularly in heterogeneous patient settings with high diagnostic risk. Existing large language model (LLM)-based systems often lack medical grounding and fail to quantify uncertainty, resulting in unsafe outputs. We propose CLIN-LLM, a safety-constrained hybrid pipeline that integrates multimodal patient encoding, uncertainty-calibrated disease classification, and retrieval-augmented treatment generation. The framework fine-tunes BioBERT on 1,200 clinical cases from the Symptom2Disease dataset and incorporates Focal Loss with Monte Carlo Dropout to enable confidence-aware predictions from free-text symptoms and structured vitals. Low-certainty cases (18%) are automatically flagged for expert review, ensuring human oversight. For treatment generation, CLIN-LLM employs Biomedical Sentence-BERT to retrieve top-k relevant dialogues from the 260,000-sample MedDialog corpus. The retrieved evidence and patient context are fed into a fine-tuned FLAN-T5 model for personalized treatment generation, followed by post-processing with RxNorm for antibiotic stewardship and drug-drug interaction (DDI) screening. CLIN-LLM achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1% (p < 0.001), with 78% top-5 retrieval precision and a clinician-rated validity of 4.2 out of 5. Unsafe antibiotic suggestions are reduced by 67% compared to GPT-5. These results demonstrate CLIN-LLM's robustness, interpretability, and clinical safety alignment. The proposed system provides a deployable, human-in-the-loop decision support framework for resource-limited healthcare environments. Future work includes integrating imaging and lab data, multilingual extensions, and clinical trial validation.

Md. Mehedi Hasan, Md. Abir Hossain, Farman Hossain Sayem, Bikash Kumar Paul, Ziaur Rahman, Mohammad Shorif Uddin, Rafid Mostafiz• 2025

Related benchmarks

Task	Dataset	Result
Medical Text Classification	Symptom2Disease	Accuracy98	14
Diagnosis	Symptom2Disease (test)	Diagnosis Accuracy98	5
Treatment Recommendation	Symptom2Disease (test)	Top-5 Treatment Precision78	5
Clinical diagnosis and treatment reasoning	Symptom2Disease, MedDialog	F1 Score98	1
Adverse reaction detection	Dataset-1, Dataset-2 ADR Twitter	--	1
Clinical text classification	Custom clinical dataset	--	1
Disease Classification	Symptom2Disease	--	1
Symptom-based prediction	Custom Symptom Dataset	--	1

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord