Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Attribution Patching Outperforms Automated Circuit Discovery

About

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Aaquib Syed, Can Rager, Arthur Conmy• 2023

Related benchmarks

TaskDatasetResultRank
Indirect Object IdentificationMechanistic Interpretability Benchmark (MIB) Indirect Object Identification (IOI) (standard)
CMD0.00e+0
12
Subject-Verb AgreementSubject-Verb Agreement (SVA) (standard)
CMD0.01
12
Circuit DiscoveryInterpBench (test)
p-value (WMW)6.48e-4
10
Circuit DiscoveryInterpBench
Vargha-Delaney A120.111
10
Multiple-choice Question AnsweringMechanistic Interpretability Benchmark (MIB) MCQA (standard)
CMD0.04
9
Circuit localizationMechanistic Interpretability Benchmark (MIB)
IOI1
9
Circuit DiscoveryIOI 200 examples v1
KL Divergence3.47
3
Circuit DiscoveryIOI 400 examples v1
KL Divergence3.66
3
Circuit DiscoveryIOI 100K examples v1
KL Divergence3.78
2
Showing 9 of 9 rows

Other info

Follow for update