Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

About

This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT's autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contributed a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public.

Xiaoyu Yang, Jie Lu, En Yu• 2025

Related benchmarks

Task	Dataset	Result
Diagnostic report generation	MIMIC-CXR 14 (test)	BLEU-415.5	27
Multi-label diagnostic classification	MS-CXR-T	Consolidation Score77.7	21
Radiology Report Generation	MIMIC-CXR (sn)	BLEU-142.6	17
Autonomous Driving Robustness Evaluation	CODA-LM 94 (test)	General Perception Score48.2	10
Multi-label Disease Classification	Open-i	F1 Score84.4	9
Driving decision-making	BDD-X	BLEU-40.355	8
Multi-label Disease Classification	PadChest	Classification Performance Score82	7
Multi-label Disease Classification	ChestXray 14	Classification Score81.7	7
Multi-label Disease Classification	CheXDet 10	Classification Score80.1	7
Autonomous Driving	DriveLM	Description Score28.8	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord