Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

About

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

Yuntai Bao, Qinfeng Li, Xinyan Yu, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, Jianwei Yin, Xuhong Zhang• 2026

Related benchmarks

Task	Dataset	Result
Concept-based Steering	AXBENCH (test)	Overall Steering Score1.102	28
Concept Steering	AXBENCH D_L20^G9B	Steering Score0.905	12
Concept Steering	AXBENCH D_L10^G2B	Steering Score0.803	9
Concept Steering	AXBENCH D_L32^Q32B	Steering Score1.102	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord