Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

About

Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of $\sim$8% in probing F1 while maintaining comparable reconstruction error, and produces 2--10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure largely independent of surface statistics. Finally, the learned interaction directions causally steer model outputs toward the corresponding compositional semantics.

Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou• 2026

Related benchmarks

TaskDatasetResultRank
ReconstructionSAEBench held-out data
MSE0.03
16
Sparse ProbingSAEBench
Average F1 Score65.7
16
Activation ReconstructionPythia 410m
MSE0.03
4
Activation ReconstructionPythia 1.4b
MSE0.22
4
Activation ReconstructionGemma-2-2B
MSE1.58
4
Activation ReconstructionGPT2-small
MSE0.53
4
Sparse ProbingPythia 410m
Average F1 Score65
4
Sparse ProbingPythia 1.4b
Avg. F164.6
4
Sparse ProbingGemma2-2b
Average F1 Score64.8
4
Sparse ProbingGPT2-small
Average F165.7
4
Showing 10 of 10 rows

Other info

Follow for update