Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mechanistic Permutability: Match Features Across Layers

About

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov• 2024

Related benchmarks

TaskDatasetResultRank
Feature MatchingGPT2 Layer 5 match with Layer 11
LLM Eval1.49
6
Feature MatchingGPT2 Layer 0 match with Layer 11
LLM Eval Score1.27
6
Feature MatchingGemma-2-2B Layer 12 match with Layer 25
LLM Evaluation Score1.26
6
Feature MatchingGemma-2-2B Layer 0 match with Layer 25
LLM Eval1.21
6
Circuit CompressionGemma-2-2B Digit Addition
Accuracy55.63
5
Circuit CompressionGPT2-small Digit Addition
Accuracy55.55
5
Feature MatchingGPT2 Layer 5 match with Layer 6
LLM Eval2.25
4
Feature MatchingGemma-2-2B Layer 12 match with Layer 13
LLM Eval1.41
4
Showing 8 of 8 rows

Other info

Follow for update