DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
About
Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Token Attribution Faithfulness | Known 1000 | Distance6.24 | 40 | |
| Token Attribution Faithfulness | SQuAD v2.0 | Disagreement13.74 | 30 | |
| Sentiment Analysis | IMDB | Dis. Score85.49 | 10 | |
| Factual Knowledge | Known 1000 | Disagreement Rate9.17 | 10 | |
| Reading Comprehension | SQuAD v2.0 | Disambiguation Score19.29 | 10 | |
| Token Attribution Faithfulness | IMDB | Distance68.19 | 10 |