Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation

About

Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance. We find that one particularly problematic interaction is attention self-repair, in which softmax redistribution causes gradients for influential attention scores to vanish as other positions with similar values compensate. We introduce Gradient Interaction Modifications (GIM), a technique that explicitly accounts for feature interactions during backpropagation. GIM achieves state-of-the-art performance on the circuit localization track of the Mechanistic Interpretability Benchmark and outperforms existing gradient-based methods on feature attribution across diverse tasks. By accounting for interaction effects and explaining why prior methods underestimate component importance, GIM enables more faithful mechanistic analysis of large language models. GIM is available as a Python package at https://github.com/corticph/gim.

Joakim Edin, Casper L. Christensen, R\'obert Csord\'as, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maal{\o}e• 2025

Related benchmarks

TaskDatasetResultRank
Feature AttributionBoolQ
Comprehensiveness72
33
Feature AttributionMovie
Comprehensiveness83
33
Feature AttributionTwitter
Comprehensiveness76
33
Feature AttributionFEVER
Comprehensiveness0.75
33
Feature AttributionHateXplain
Comprehensiveness80
33
Feature AttributionSciFact
Comprehensiveness74
33
Circuit localizationIOI
CPR3.54
30
Circuit localizationMCQA
CPR2.52
21
Circuit localizationARC Easy
CPR2.36
12
Circuit localizationGeneral
CPR2.13
6
Showing 10 of 12 rows

Other info

Follow for update