Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

About

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov• 2025

Related benchmarks

TaskDatasetResultRank
Feature MatchingGPT2 Layer 5 match with Layer 11
LLM Eval1.49
6
Feature MatchingGPT2 Layer 0 match with Layer 11
LLM Eval Score1.34
6
Feature MatchingGemma-2-2B Layer 12 match with Layer 25
LLM Evaluation Score1.26
6
Feature MatchingGemma-2-2B Layer 0 match with Layer 25
LLM Eval1.23
6
Circuit CompressionGPT2-small Digit Addition
Accuracy66.55
5
Circuit CompressionGemma-2-2B Digit Addition
Accuracy53.49
5
Feature MatchingGPT2 Layer 5 match with Layer 6
LLM Eval2.46
4
Feature MatchingGemma-2-2B Layer 12 match with Layer 13
LLM Eval1.51
4
Showing 8 of 8 rows

Other info

Follow for update