Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

About

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych• 2026

Related benchmarks

TaskDatasetResultRank
Privacy and Utility EvaluationPasswordEval
Privacy Score93.07
36
Privacy and Utility EvaluationPEEP
Privacy Score87.85
36
Instruction FollowingMath-IF
IF-RT Score61.97
30
Privacy EvaluationPasswordEval
Privacy Rate (RT)92.34
30
Privacy EvaluationPEEP
Privacy RT85.63
30
Instruction FollowingIFEval
IF-RT Score76.26
30
Showing 6 of 6 rows

Other info

Follow for update