Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Weight Updates as Activation Shifts: A Principled Framework for Steering

About

Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.

Dyah Adila, John Cooper, Alexander Yun, Avi Trost, Frederic Sala• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy86.6
1085
Question AnsweringARC Challenge
Accuracy88.4
906
Mathematical ReasoningGSM8K
Accuracy43.4
499
Logical reasoningListOps
Accuracy74
32
Boolean Question AnsweringBoolQ
Accuracy92.3
29
Question AnsweringBoolQ
Accuracy91.7
16
Commonsense ReasoningWinoGrande
Accuracy87.2
16
Algebraic ReasoningAQUA-RAT
Accuracy65
16
Language ModelingNLP Benchmark Suite Aggregate
Average Delta-4.7
16
Instruction TuningAlpacaEval 2.0 (test)
Win Rate (LC)11.34
7
Showing 10 of 10 rows

Other info

Follow for update