Effectively Controlling Reasoning Models through Thinking Intervention

About

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We find that the Thinking Intervention paradigm enhances the capabilities of reasoning models across a wide range of tasks, including instruction following on IFEval and Overthinking, instruction hierarchy on SEP, and safety alignment on XSTest and SorryBench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

Tong Wu, Chong Xiang, Jiachen T. Wang, G. Edward Suh, Prateek Mittal• 2025

Related benchmarks

Task	Dataset	Result
Safety Evaluation	XSTest Unsafe	False Refusal Rate (FR)36	84
Safety Evaluation	XSTest Safe	FC6	78
Mathematical Reasoning	MATH500	Pass@193.5	77
Safety Evaluation	AdvBench	Reasoning Harmfulness Rate0.00e+0	50
Jailbreak Attack	PAIR	Harmful Score2	46
Safety Evaluation	XSTest	F1 Score74	44
Safety Evaluation	XSTest (combined)	F1 Score88	34
Safety Evaluation	Sorrybench	Reasoning Success Rate (FFR)15.9	32
Expert knowledge QA	GPQA	Pass@160	29
Harmfulness Evaluation	AdvBench	Harmfulness Score1.08	28

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord