Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

About

Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.

Qinhao Chen, Linyang He, Nima Mesgarani• 2026

Related benchmarks

Task	Dataset	Result
Doc-String Prediction	Doc-String	Last-Token KL Divergence0.12	40
Indirect Object Identification	IOI	Last-Token KL Divergence0.15	40
Efficient Circuit Discovery	LLaMA 1B 3.2	SCE (Efficiency)9.48e+5	4
Efficient Circuit Discovery	Gemma-2-2B	SCE (Efficiency)1.62e+6	4
Circuit Interpretability Quality	Llama-3.2-1B circuits	Clarity0.627	3
Circuit Interpretability Quality	Gemma-2-2B circuits	Clarity0.635	3
Semantic Efficiency Analysis	LLaMA 1B 3.2	Semantic Efficiency Score (SCE)9.48e+5	3
Semantic Efficiency Analysis	Gemma-2-2B	SCE1.62e+6	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord