Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Steering Language Models With Activation Engineering

About

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid• 2023

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU
MMLU Accuracy51.6
126
Safety AlignmentHarmBench
ASR0.00e+0
88
MathGSM8K
Accuracy0.8832
87
Mathematical Problem SolvingAIME 25
Accuracy50
54
CodeHumanEval
HumanEval Accuracy77.85
50
General Capability EvaluationtinyBenchmarks
AI2_arc Accuracy79
48
Language UnderstandingMMLU
MMLU Score50.9
45
ControllabilityPolyGuard
PolyGuard100
40
ControllabilityLLM Judge
Controllability Score90.38
40
ControllabilityHarmBench
HarmBench Score86.54
40
Showing 10 of 29 rows

Other info

Follow for update