Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Automated Circuit Discovery for Mechanistic Interpretability

About

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adri\`a Garriga-Alonso• 2023

Related benchmarks

TaskDatasetResultRank
Indirect Object IdentificationIOI
Last-Token KL Divergence0.64
40
Doc-String PredictionDoc-String
Last-Token KL Divergence0.36
40
Circuit localizationIOI
CPR2.3
30
Circuit localizationSequence Completion 1.0 (test)
CPR0.906
28
Circuit localizationMixing dataset
CMD0.093
28
Circuit localizationMixing dataset All tasks 1.0 (test)
CPR0.692
28
Circuit localizationIndirect Object Identification (IOI) 1.0 (test)
CPR0.87
28
Circuit localizationMixing dataset All tasks
CMD0.307
28
Circuit localizationMixing dataset IOI
CMD0.129
28
Circuit localizationMCQA
CPR0.85
21
Showing 10 of 24 rows

Other info

Code

Follow for update