Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

About

Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model's representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
HellaSwag Accuracy87.4
711
Multi-task Language UnderstandingMMLU
MMLU Accuracy70.2
442
Instruction FollowingAlpaca--
173
Bias EvaluationBBQ
Accuracy64.5
171
General Language UnderstandingGLUE
Accuracy83.2
75
Reading ComprehensionRACE
Accuracy68.3
59
Attribute SteeringHelpSteer
Helpfulness3.89
22
Refusal BehaviorRefusal
Sorry Rate69.3
22
Question AnsweringTruthfulQA
MC1 Score34.91
22
Natural Language UnderstandingGLUE
SST-2 Accuracy97.99
12
Showing 10 of 12 rows

Other info

Follow for update