Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

About

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu• 2026

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU-R--
40
General ReasoningMMLU-P--
24
General ReasoningGPQA
multi@5 Accuracy72.7
16
Math ReasoningMATH 500
Multi@5 Accuracy58.2
16
Math ReasoningThmQA
Multi@5 Accuracy34.3
16
Mathematical ReasoningMATH
Multi-step pass@5 Accuracy55.9
16
Mathematical ReasoningMath Benchmarks MATH, MATH500, ThmQA
MATH multi@5 Accuracy67.6
4
Multi-turn reasoningAll-benchmark Average
Average Multi-turn Accuracy (multi@5)0.683
4
General ReasoningGeneral Benchmarks MMLU-R, MMLU-P, GPQA
MMLU-R (multi@5 Acc)91.2
4
Showing 9 of 9 rows

Other info

GitHub

Follow for update