Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
About
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Feature Matching | GPT2 Layer 5 match with Layer 11 | LLM Eval1.49 | 6 | |
| Feature Matching | GPT2 Layer 0 match with Layer 11 | LLM Eval Score1.34 | 6 | |
| Feature Matching | Gemma-2-2B Layer 12 match with Layer 25 | LLM Evaluation Score1.26 | 6 | |
| Feature Matching | Gemma-2-2B Layer 0 match with Layer 25 | LLM Eval1.23 | 6 | |
| Circuit Compression | GPT2-small Digit Addition | Accuracy66.55 | 5 | |
| Circuit Compression | Gemma-2-2B Digit Addition | Accuracy53.49 | 5 | |
| Feature Matching | GPT2 Layer 5 match with Layer 6 | LLM Eval2.46 | 4 | |
| Feature Matching | Gemma-2-2B Layer 12 match with Layer 13 | LLM Eval1.51 | 4 |