Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression
About
Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Feature Matching | GPT2 Layer 5 match with Layer 11 | LLM Eval1.56 | 6 | |
| Feature Matching | GPT2 Layer 0 match with Layer 11 | LLM Eval Score1.39 | 6 | |
| Feature Matching | Gemma-2-2B Layer 12 match with Layer 25 | LLM Evaluation Score1.83 | 6 | |
| Feature Matching | Gemma-2-2B Layer 0 match with Layer 25 | LLM Eval1.83 | 6 | |
| Circuit Compression | Gemma-2-2B Digit Addition | Accuracy61.51 | 5 | |
| Circuit Compression | GPT2-small Digit Addition | Accuracy68.12 | 5 | |
| Feature Matching | GPT2 Layer 5 match with Layer 6 | LLM Eval2.53 | 4 | |
| Feature Matching | Gemma-2-2B Layer 12 match with Layer 13 | LLM Eval2.32 | 4 |