ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

About

Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.

Zhebo Wang, Xiaohu Mu, Zijie Zhou, Mohan Li, Wenpeng Xing, Dezhang Kong, Meng Han• 2026

Related benchmarks

Task	Dataset	Result	Rank
Mathematical Reasoning	Competition-level Math Benchmarks AIME24, AIME25, AMC23, MATH500, Olympiad, Minerva	AIME 24 Score12.9		52
Multi-turn conversation	Multi-turn (Mt.)	Mt. Score55.4		6

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord