Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scaling and evaluating sparse autoencoders

About

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Leo Gao, Tom Dupr\'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100
Accuracy41.8
302
Concept Extraction Evaluation4 classification datasets average
RAcc98.47
35
Sparse Autoencoder Concept AlignmentCUB
Sparsity0.995
18
Activation ReconstructionPythia model activations
Pearson Correlation Coefficient0.7074
18
Concept Component AnalysisConcept Component Analysis Evaluation Set (test)
Pearson Correlation (MPC)0.7027
18
Concept Extraction ConsistencyCoNLL
MPPC0.761
14
Concept Extraction ConsistencyWikiArt
MPPC86.1
14
Concept Extraction ConsistencyImageNet
MPPC0.757
14
Concept Extraction ConsistencyIMDB
MPPC99.6
14
Concept Extraction ConsistencyAudioSet
MPPC60.1
7
Showing 10 of 14 rows

Other info

Follow for update