Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Activation Steering with a Feedback Controller

About

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control. The code is publicly available at: https://github.com/dungnvnus/pid-steering

Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen• 2025

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU
MMLU Accuracy78.46
442
Multi-task Language UnderstandingMMLU
Accuracy80.16
353
Multitask Language UnderstandingMMLU
Accuracy80.16
263
Language UnderstandingMMLU 5-shot--
153
Language UnderstandingMMLU 5-shot (test)--
149
General Language UnderstandingtinyBenchmark
Accuracy (ARC)72.93
81
Truthfulness EvaluationTruthfulQA
T·I Score58.07
59
Toxicity MitigationToxicity Mitigation Dataset 1000 trials (test)
CLS Toxicity (%)0.34
58
Truthful and Informative GenerationTruthfulQA (test)
True*Info (%)58.07
44
Toxicity MitigationToxicity prompts
CLS Toxicity (%)0.7
32
Showing 10 of 14 rows

Other info

Follow for update