Multi-Attribute Steering of Language Models via Targeted Intervention

About

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	Alpaca	--	173
Bias Evaluation	BBQ	Accuracy64.1	171
Question Answering	TruthfulQA	Accuracy61.94	164
Question Answering	OpenBookQA (OBQA) (test)	OBQA Accuracy77.46	130
Toxicity Detection	Toxigen	Score57.82	95
Truthful QA	Truthful QA	Accuracy64.36	83
Logical reasoning	FOLIO (test)	Accuracy80.6	58
Assistant Response Alignment (Helpfulness and Harmlessness)	HH-RLHF (test)	--	31
Toxicity Classification	Toxigen	Accuracy60.41	22
Attribute Steering	HelpSteer	Helpfulness3.84	22

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord