Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-Attribute Steering of Language Models via Targeted Intervention

About

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringTruthfulQA
Accuracy61.94
152
Question AnsweringOpenBookQA (OBQA) (test)
OBQA Accuracy77.46
130
Bias EvaluationBBQ
Accuracy62.59
113
Truthful QATruthful QA
Accuracy64.36
83
Logical reasoningFOLIO (test)
Accuracy80.6
58
Toxicity DetectionToxigen
Score57.82
53
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)--
31
Toxicity ClassificationToxigen
Accuracy60.41
22
ReasoningMuSR (test)
Accuracy73.1
14
ReasoningBBEH (test)
Accuracy32.6
14
Showing 10 of 19 rows

Other info

Code

Follow for update