Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

About

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse• 2026

Related benchmarks

TaskDatasetResultRank
Sycophancy EvaluationPoli
Sycophantic Preference (%)92.18
10
Sycophancy EvaluationSyco-Bench
Pickside Score0.8
10
Sycophancy EvaluationOpen-Ended Sycophancy
Syc Score44.44
10
Sycophancy EvaluationNLP
Sycophancy Preference50
10
Sycophancy EvaluationPHIL
Sycophancy Preference69.56
10
Showing 5 of 5 rows

Other info

Follow for update