The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

About

Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy77.6	711
Multiple-choice Question Answering	ARC Easy	Accuracy84.1	257
Multiple-choice Question Answering	MMLU	Accuracy71.2	210
Truthful Question Answering	TruthfulQA MC2	MC2 Accuracy44.1	51
Open-ended	AlpacaEval	Win Rate vs Davinci-00313.56	40
Conversational Ability	MT-Bench	MT-Bench Score4.33	28
Multiple-choice commonsense reasoning	WinoGrande	Winogrande Accuracy74	18

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord