Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
About
Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU | MMLU Accuracy82.82 | 442 | |
| Multi-task Language Understanding | MMLU | Accuracy75.4 | 353 | |
| Multitask Language Understanding | MMLU | Accuracy75.4 | 263 | |
| Language Understanding | MMLU 5-shot (test) | -- | 149 | |
| Truthfulness Evaluation | TruthfulQA | T·I Score84.7 | 59 | |
| Toxicity Mitigation | Toxicity Mitigation Dataset 1000 trials (test) | CLS Toxicity (%)0.04 | 58 | |
| Truthful and Informative Generation | TruthfulQA (test) | True*Info (%)84.7 | 44 | |
| Toxicity Mitigation | Toxicity prompts | CLS Toxicity (%)0.12 | 32 | |
| Jailbreaking | AdvBench 20% evaluation | ASR97.12 | 25 |