Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

About

Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
HellaSwag Accuracy77.6
350
Multiple-choice Question AnsweringARC Easy
Accuracy84.1
188
Multiple-choice Question AnsweringMMLU
Accuracy71.2
185
Truthful Question AnsweringTruthfulQA MC2
MC2 Accuracy44.1
46
Open-endedAlpacaEval
Win Rate vs Davinci-00313.56
40
Conversational AbilityMT-Bench
MT-Bench Score4.33
28
Multiple-choice commonsense reasoningWinoGrande
Winogrande Accuracy74
18
Showing 7 of 7 rows

Other info

Follow for update