ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

About

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: (i) the lack of a unified theoretical framework for guiding the design of steering directions, and (ii) an over-reliance on one-step steering that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a barrier function from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows empirical advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for multi-step and adaptive steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao• 2026

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU	MMLU Accuracy79.72	442
Multi-task Language Understanding	MMLU	Accuracy78.08	353
Multitask Language Understanding	MMLU	Accuracy78.08	263
Multiple-choice Question Answering	MMLU	Accuracy60.9	210
Language Understanding	MMLU 5-shot (test)	--	149
Multiple-choice Question Answering	ARC Challenge	Acc74.5	133
Truthfulness Evaluation	TruthfulQA	T·I Score71.56	59
Toxicity Mitigation	Toxicity Mitigation Dataset 1000 trials (test)	CLS Toxicity (%)0.36	58
Language model detoxification	RealToxicityPrompts (test)	Distinct-190.6	54
Truthful and Informative Generation	TruthfulQA (test)	True*Info (%)71.56	44

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord