Learning Multi-Level Features with Matryoshka Sparse Autoencoders

About

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy54.88	816
Code Generation	HumanEval (test)	Pass@136.13	612
Code Generation	MBPP (test)	Pass@135.76	405
Interpretability and Faithfulness Evaluation	DINOv2 ViT-B/14 tokens	LLM Rank5	22
Concept Extraction Consistency	IMDB	MPPC70.7	14
Concept Extraction Consistency	ImageNet	MPPC0.225	14
Concept Extraction Consistency	WikiArt	MPPC24.7	14
Concept Extraction Consistency	CoNLL	MPPC0.339	14
Summarization	Summary	Score44.25	13
Legal Reasoning	Law	Score21.08	13

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord