Attribution Patching Outperforms Automated Circuit Discovery

About

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Aaquib Syed, Can Rager, Arthur Conmy• 2023

Related benchmarks

Task	Dataset	Result
Circuit localization	IOI	CPR1.29	30
Circuit localization	Mixing dataset IOI	CMD0.026	28
Circuit localization	Mixing dataset All tasks	CMD0.042	28
Circuit localization	Mixing dataset	CMD0.041	28
Circuit localization	Mixing dataset All tasks 1.0 (test)	CPR0.956	28
Circuit localization	Indirect Object Identification (IOI) 1.0 (test)	CPR0.984	28
Circuit localization	Sequence Completion 1.0 (test)	CPR0.958	28
Circuit localization	MCQA	CPR1.49	21
Circuit localization	Mixing dataset Entity Binding	CMD0.026	18
Circuit localization	Entity-binding 1.0 (test)	CPR0.981	18

Showing 10 of 40 rows

Other info

Follow for update

@wizwand_team Discord