DISCO: Disentangled Communication Steering for Large Language Models
About
A variety of recent methods guide large language model outputs via the inference-time addition of steering vectors to residual-stream or attention-head representations. In contrast, we propose to inject steering vectors directly into the query and value representation spaces within attention heads. We provide evidence that a greater portion of these spaces exhibit high linear discriminability of concepts --a key property motivating the use of steering vectors-- than attention head outputs. We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs. Our analysis reveals that DISCO disentangles a strong but underutilized baseline, steering attention inputs, which implicitly modifies queries and values in a rigid manner. In contrast, DISCO's direct modulation of these components enables more granular control. We find that DISCO achieves superior performance over a number of steering vector baselines across multiple datasets on LLaMA 3.1 8B and Gemma 2 9B, with steering efficacy scoring up to 19.1% higher than the runner-up. Our results support the conclusion that the query and value spaces are powerful building blocks for steering vector methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Truthfulness Steering | TruthfulQA | T×I Score78.66 | 28 | |
| Instruction Following | IFBench | Accuracy11.5 | 18 | |
| Cognitive style steering | Bloom's Taxonomy Phi generations (test) | Remember Hit Rate3.9 | 14 | |
| Model Steering | Steering Evaluation Suite Power, Wealth, Corr, TQA Gemma-2-9B-IT (test) | Power2.61 | 10 | |
| Question Answering | TruthfulQA | True*Info Score (TQA)81.6 | 10 | |
| Steering | Power | LLM Judge Score2.91 | 10 | |
| Steering | Wealth | LLM Judge Score2.25 | 10 | |
| Steering | Corrigibility | LLM Judge Score3.22 | 10 | |
| Mathematical Reasoning | GSM8K | Accuracy22.5 | 10 | |
| Question Answering | ARC Challenge | Accuracy34.2 | 10 |