Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral
About
Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multilingual Classification | Natural-prevalence English (test) | F1-Macro86.05 | 5 | |
| Multilingual Classification | Natural-prevalence Hindi (test) | F1-Macro88.76 | 5 | |
| Multilingual Classification | Natural-prevalence Punjabi (test) | F1-Macro88.94 | 5 | |
| Multilingual diagnostic classification | Natural-prevalence clinical narrative Average of EN, HI, PA (test) | F1-Macro87.92 | 5 | |
| Orthopedic Classification | Orthopedic diagnostic dataset Averaged across EN, HI, PA (test) | F1-Macro81.77 | 5 | |
| Orthopedic Diagnostic Classification | Clinical Narratives English (controlled setting) | F1-Macro80.78 | 5 | |
| Orthopedic Diagnostic Classification | Clinical Narratives Hindi (controlled setting) | F1-Macro84.41 | 5 | |
| Orthopedic Diagnostic Classification | Clinical Narratives Punjabi (controlled setting) | F1-Macro80.11 | 5 | |
| Selective Verification | natural-prevalence verification 5,000-record held-out (English) | Selective Accuracy85.4 | 1 | |
| Selective Verification | Natural-Prevalence Verification 5,000-record held-out (Hindi) | Selective Accuracy84.6 | 1 |