From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

About

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych• 2026

Related benchmarks

Task	Dataset	Result
Privacy and Utility Evaluation	PasswordEval	Privacy Score93.07	36
Privacy and Utility Evaluation	PEEP	Privacy Score87.85	36
Instruction Following	Math-IF	IF-RT Score61.97	30
Privacy Evaluation	PasswordEval	Privacy Rate (RT)92.34	30
Privacy Evaluation	PEEP	Privacy RT85.63	30
Instruction Following	IFEval	IF-RT Score76.26	30

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord