Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attribution Patching Outperforms Automated Circuit Discovery

About

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Aaquib Syed, Can Rager, Arthur Conmy• 2023

Related benchmarks

TaskDatasetResultRank
Circuit localizationIOI
CPR1.29
30
Circuit localizationMixing dataset IOI
CMD0.026
28
Circuit localizationMixing dataset All tasks
CMD0.042
28
Circuit localizationMixing dataset
CMD0.041
28
Circuit localizationMixing dataset All tasks 1.0 (test)
CPR0.956
28
Circuit localizationIndirect Object Identification (IOI) 1.0 (test)
CPR0.984
28
Circuit localizationSequence Completion 1.0 (test)
CPR0.958
28
Circuit localizationMCQA
CPR1.49
21
Circuit localizationMixing dataset Entity Binding
CMD0.026
18
Circuit localizationEntity-binding 1.0 (test)
CPR0.981
18
Showing 10 of 35 rows

Other info

Follow for update