Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

About

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

Tue M. Cao, Nguyen Do, My T. Thai• 2026

Related benchmarks

TaskDatasetResultRank
Feature MatchingGPT2 Layer 5 match with Layer 11
LLM Eval1.56
6
Feature MatchingGPT2 Layer 0 match with Layer 11
LLM Eval Score1.39
6
Feature MatchingGemma-2-2B Layer 12 match with Layer 25
LLM Evaluation Score1.83
6
Feature MatchingGemma-2-2B Layer 0 match with Layer 25
LLM Eval1.83
6
Circuit CompressionGemma-2-2B Digit Addition
Accuracy61.51
5
Circuit CompressionGPT2-small Digit Addition
Accuracy68.12
5
Feature MatchingGPT2 Layer 5 match with Layer 6
LLM Eval2.53
4
Feature MatchingGemma-2-2B Layer 12 match with Layer 13
LLM Eval2.32
4
Showing 8 of 8 rows

Other info

Follow for update