Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

About

Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

Zehao Jin, Ruixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Concept SteeringAxBench (Held-in)
HMean1.185
25
Concept SteeringAxBench (Held-out)
HMean1.113
6
Showing 2 of 2 rows

Other info

Follow for update