Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

About

Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.

Aishik Nagar, Arun-Kumar Kaliya-Perumal, Yu-Hsuan Han, Andrew Sheng-Han Huang, Kristen Kee, Yushi Cao, Yiming Chen, Hongchao Jiang• 2026

Related benchmarks

TaskDatasetResultRank
Multiple-choice Question AnsweringMedMCQA
Accuracy67
42
Open-set diagnostic namingDDXPlus
Accuracy48.5
15
Medical calculationMedCalc-Bench
Accuracy46
15
Instruction FollowingMimic-Instr MIMIC-IV
Accuracy58.9
15
LLM-as-a-judge alignmentClinician Spine cohort (val R1)--
5
LLM-as-a-judge alignmentClinician Validation Obesity cohort (R1)--
5
Blinded A/B preferenceClinician Validation Spine cohort (R1)
Win Rate94.2
3
Blinded A/B preferenceClinician Obesity cohort R1 (val)
Win Percentage82.2
3
Showing 8 of 8 rows

Other info

Follow for update