DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
About
Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy77.3 | 350 | |
| Multiple-choice Question Answering | ARC Easy | Accuracy84.3 | 188 | |
| Multiple-choice Question Answering | MMLU | Accuracy71.2 | 185 | |
| Truthful Question Answering | TruthfulQA MC2 | MC2 Accuracy45.4 | 46 | |
| Open-ended | AlpacaEval | Win Rate vs Davinci-00320 | 40 | |
| Conversational Ability | MT-Bench | MT-Bench Score5.02 | 28 | |
| Multiple-choice commonsense reasoning | WinoGrande | Winogrande Accuracy73.8 | 18 |