Steering Language Models With Activation Engineering
About
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Reasoning | MMLU | MMLU Accuracy51.6 | 126 | |
| Safety Alignment | HarmBench | ASR0.00e+0 | 88 | |
| Math | GSM8K | Accuracy0.8832 | 87 | |
| Mathematical Problem Solving | AIME 25 | Accuracy50 | 54 | |
| Code | HumanEval | HumanEval Accuracy77.85 | 50 | |
| General Capability Evaluation | tinyBenchmarks | AI2_arc Accuracy79 | 48 | |
| Language Understanding | MMLU | MMLU Score50.9 | 45 | |
| Controllability | PolyGuard | PolyGuard100 | 40 | |
| Controllability | LLM Judge | Controllability Score90.38 | 40 | |
| Controllability | HarmBench | HarmBench Score86.54 | 40 |