Finding Interpretable Prompt-Specific Circuits in Language Models
About
Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are interpretable: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model components are reused across languages, signals are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Circuit localization | Mixing dataset All tasks 1.0 (test) | CPR1.026 | 28 | |
| Circuit localization | Mixing dataset All tasks | CMD0.012 | 28 | |
| Circuit localization | Mixing dataset IOI | CMD0.022 | 28 | |
| Circuit localization | Indirect Object Identification (IOI) 1.0 (test) | CPR1.015 | 28 | |
| Circuit localization | Mixing dataset | CMD0.052 | 28 | |
| Circuit localization | Sequence Completion 1.0 (test) | CPR0.958 | 28 | |
| Circuit localization | Entity-binding 1.0 (test) | CPR1.112 | 18 | |
| Circuit localization | Mixing dataset Entity Binding | CMD0.017 | 18 | |
| Circuit localization | Arithmetic 1.0 (test) | CPR1.017 | 9 | |
| Circuit localization | Mixing dataset Arithmetic | CMD0.2 | 9 |