Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

About

Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique.

Weixuan Wang, Jingyuan Yang, Wei Peng• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy91.95
797
Mathematical ReasoningMATH500 (test)
Accuracy55.94
381
Question AnsweringBoolQ
Accuracy70.4
240
Question AnsweringWinoGrande (WG)
Accuracy51.93
98
Multiple-ChoiceTruthfulQA
MC1 Accuracy39.28
83
Story completionStoryCloze
Accuracy75.72
65
Open-domain Question AnsweringTriviaQA
EM43.5
62
Question AnsweringCOPA
Accuracy84
59
Multiple-choice Question AnsweringTruthfulQA MC1
MC1 Accuracy38.53
33
Toxicity DetectionToxigen
Score17.14
25
Showing 10 of 20 rows

Other info

Follow for update