Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation

About

Accurate symptom-to-disease classification and clinically grounded treatment recommendations remain challenging, particularly in heterogeneous patient settings with high diagnostic risk. Existing large language model (LLM)-based systems often lack medical grounding and fail to quantify uncertainty, resulting in unsafe outputs. We propose CLIN-LLM, a safety-constrained hybrid pipeline that integrates multimodal patient encoding, uncertainty-calibrated disease classification, and retrieval-augmented treatment generation. The framework fine-tunes BioBERT on 1,200 clinical cases from the Symptom2Disease dataset and incorporates Focal Loss with Monte Carlo Dropout to enable confidence-aware predictions from free-text symptoms and structured vitals. Low-certainty cases (18%) are automatically flagged for expert review, ensuring human oversight. For treatment generation, CLIN-LLM employs Biomedical Sentence-BERT to retrieve top-k relevant dialogues from the 260,000-sample MedDialog corpus. The retrieved evidence and patient context are fed into a fine-tuned FLAN-T5 model for personalized treatment generation, followed by post-processing with RxNorm for antibiotic stewardship and drug-drug interaction (DDI) screening. CLIN-LLM achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1% (p < 0.001), with 78% top-5 retrieval precision and a clinician-rated validity of 4.2 out of 5. Unsafe antibiotic suggestions are reduced by 67% compared to GPT-5. These results demonstrate CLIN-LLM's robustness, interpretability, and clinical safety alignment. The proposed system provides a deployable, human-in-the-loop decision support framework for resource-limited healthcare environments. Future work includes integrating imaging and lab data, multilingual extensions, and clinical trial validation.

Md. Mehedi Hasan, Md. Abir Hossain, Farman Hossain Sayem, Bikash Kumar Paul, Ziaur Rahman, Mohammad Shorif Uddin, Rafid Mostafiz• 2025

Related benchmarks

TaskDatasetResultRank
Medical Text ClassificationSymptom2Disease
Accuracy98
14
DiagnosisSymptom2Disease (test)
Diagnosis Accuracy98
5
Treatment RecommendationSymptom2Disease (test)
Top-5 Treatment Precision78
5
Clinical diagnosis and treatment reasoningSymptom2Disease, MedDialog
F1 Score98
1
Adverse reaction detectionDataset-1, Dataset-2 ADR Twitter--
1
Clinical text classificationCustom clinical dataset--
1
Disease ClassificationSymptom2Disease--
1
Symptom-based predictionCustom Symptom Dataset--
1
Showing 8 of 8 rows

Other info

Follow for update