A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

About

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse• 2026

Related benchmarks

Task	Dataset	Result
Sycophancy Evaluation	Poli	Sycophantic Preference (%)92.18	10
Sycophancy Evaluation	Syco-Bench	Pickside Score0.8	10
Sycophancy Evaluation	Open-Ended Sycophancy	Syc Score44.44	10
Sycophancy Evaluation	NLP	Sycophancy Preference50	10
Sycophancy Evaluation	PHIL	Sycophancy Preference69.56	10

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord