Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
About
Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Doc-String Prediction | Doc-String | Last-Token KL Divergence0.12 | 40 | |
| Indirect Object Identification | IOI | Last-Token KL Divergence0.15 | 40 | |
| Efficient Circuit Discovery | LLaMA 1B 3.2 | SCE (Efficiency)9.48e+5 | 4 | |
| Efficient Circuit Discovery | Gemma-2-2B | SCE (Efficiency)1.62e+6 | 4 | |
| Circuit Interpretability Quality | Llama-3.2-1B circuits | Clarity0.627 | 3 | |
| Circuit Interpretability Quality | Gemma-2-2B circuits | Clarity0.635 | 3 | |
| Semantic Efficiency Analysis | LLaMA 1B 3.2 | Semantic Efficiency Score (SCE)9.48e+5 | 3 | |
| Semantic Efficiency Analysis | Gemma-2-2B | SCE1.62e+6 | 3 |