Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Steer Like the LLM: Activation Steering that Mimics Prompting

About

Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.

Geert Heyman, Frederik Vandeputte• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
IFEval Accuracy93.1
836
Trait AlignmentPersona Vectors
TA@Cp96.4
30
Activation SteeringAxBench Gemma-2-2B layer 20
Steering Score0.871
18
Activation SteeringAxBench Gemma-2-9B layer 20
Steering Score1.12
17
Showing 4 of 4 rows

Other info

Follow for update