Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-Attribute Steering of Language Models via Targeted Intervention

About

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringOpenBookQA (OBQA) (test)
OBQA Accuracy77.46
130
Bias EvaluationBBQ
Accuracy62.59
99
Truthful QATruthful QA
Accuracy64.36
83
Question AnsweringTruthfulQA
Accuracy61.94
82
Logical reasoningFOLIO (test)
Accuracy80.6
58
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)--
31
Toxicity DetectionToxigen
Score57.82
25
Toxicity ClassificationToxigen
Accuracy60.41
22
ReasoningMuSR (test)
Accuracy73.1
14
ReasoningBBEH (test)
Accuracy32.6
14
Showing 10 of 19 rows

Other info

Code

Follow for update