Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

About

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications. Code is available at https://github.com/Tencent/RLSTA.

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Accuracy86
115
Multi-task EvaluationAggregate (GSM8K, BFCL, Spider, HumanEval)
Average Accuracy79.4
20
Multi-turn response additionMT-Add
Math Score (MT-Add)90.3
20
Multi-turn response refinementMT-Refine
Math Score89.8
20
Function Calling / Tool UseBFCL parallel parallel-multiple Actions
Accuracy81.8
20
Text-to-SQLSpider no-easy
Accuracy59.8
20
Long-context SummarySummary Task (test)
Multi-turn Coverage Score63.2
8
Multi-turn reasoningMT-Add
Math LiC Score1.001
4
Showing 8 of 8 rows

Other info

Follow for update