Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Steering Awareness: Detecting Activation Steering from Within

About

Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.

Joshua Fonseca Rivera, David Demitri Africa• 2025

Related benchmarks

TaskDatasetResultRank
Steering Detection121 concepts out-of-distribution suites (held-out)
Detection Rate95.5
14
Steering Success RatePopQA 150 questions
Base SR33
5
Showing 2 of 2 rows

Other info

Follow for update