Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

About

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

Xiangyu Hong, Che Jiang, Kai Tian, Biqing Qi, Youbang Sun, Ning Ding, Bowen Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Token Attribution FaithfulnessKnown 1000
Distance6.24
40
Token Attribution FaithfulnessSQuAD v2.0
Disagreement13.74
30
Sentiment AnalysisIMDB
Dis. Score85.49
10
Factual KnowledgeKnown 1000
Disagreement Rate9.17
10
Reading ComprehensionSQuAD v2.0
Disambiguation Score19.29
10
Token Attribution FaithfulnessIMDB
Distance68.19
10
Showing 6 of 6 rows

Other info

Follow for update