Steering Language Models With Activation Engineering

About

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy92.2	388
Mathematical Reasoning	MATH 500	Accuracy94.2	221
Math	GSM8K	Accuracy0.8832	216
General Reasoning	MMLU	MMLU Accuracy51.6	180
Coding	MBPP	Accuracy70.37	145
Math	MATH 500	Accuracy85.4	120
Code	HumanEval	HumanEval Accuracy77.85	118
Safety Alignment	HarmBench	ASR0.00e+0	88
Math Reasoning	SVAMP	Accuracy95.2	85
General Language Understanding	tinyBenchmark	Accuracy (ARC)59.72	81

Showing 10 of 50 rows

Other info

Follow for update

@wizwand_team Discord