Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

About

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

Yuntai Bao, Qinfeng Li, Xinyan Yu, Xuhong Zhang, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, Jianwei Yin• 2026

Related benchmarks

TaskDatasetResultRank
Concept-based SteeringAXBENCH (test)
Overall Steering Score1.102
28
Concept SteeringAXBENCH D_L20^G9B
Steering Score0.905
12
Concept SteeringAXBENCH D_L10^G2B
Steering Score0.803
9
Concept SteeringAXBENCH D_L32^Q32B
Steering Score1.102
7
Showing 4 of 4 rows

Other info

Follow for update