Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

About

Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.

Qinhao Chen, Linyang He, Nima Mesgarani• 2026

Related benchmarks

TaskDatasetResultRank
Doc-String PredictionDoc-String
Last-Token KL Divergence0.12
40
Indirect Object IdentificationIOI
Last-Token KL Divergence0.15
40
Efficient Circuit DiscoveryLLaMA 1B 3.2
SCE (Efficiency)9.48e+5
4
Efficient Circuit DiscoveryGemma-2-2B
SCE (Efficiency)1.62e+6
4
Circuit Interpretability QualityLlama-3.2-1B circuits
Clarity0.627
3
Circuit Interpretability QualityGemma-2-2B circuits
Clarity0.635
3
Semantic Efficiency AnalysisLLaMA 1B 3.2
Semantic Efficiency Score (SCE)9.48e+5
3
Semantic Efficiency AnalysisGemma-2-2B
SCE1.62e+6
3
Showing 8 of 8 rows

Other info

Follow for update