Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

About

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, Talia Konkle• 2025

Related benchmarks

TaskDatasetResultRank
Interpretability and Faithfulness EvaluationDINOv2 ViT-B/14 tokens
LLM Rank5
22
Sparse Autoencoder Concept AlignmentCUB
Sparsity0.994
18
SAE Interpretability and Faithfulness EvaluationDINOv2 ViT-B 14 Layer 12 activations
LLM Rank3
12
Interpretability and Faithfulness EvaluationImageNet 21k ViT-B/16 tokens
LLM Rank2
10
Computational EfficiencyDINOv2 ViT-B/14 (layer 11)
Latency (ms/batch)437.4
10
Dictionary Learning Stability and Geometry EvaluationDINOv2-B/14 activations (three seeded token datasets)
Cross-Init Similarity86.85
9
SAE Interpretability and Faithfulness EvaluationDINOv2 ViT-S 14 activations
LLM Rank5
6
SAE Interpretability and Faithfulness EvaluationSigLIP2-B 16 activations
LLM Rank5
6
SAE Interpretability and Faithfulness EvaluationConvNeXt Base activations V2
LLM Rank5
6
SAE Interpretability and Faithfulness EvaluationDINOv2 ViT-L/14 activations
LLM Rank2
6
Showing 10 of 10 rows

Other info

Follow for update