Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling and evaluating sparse autoencoders

About

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Leo Gao, Tom Dupr\'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100
Accuracy41.8
302
Misalignment DetectionTaylor
Accuracy94.3
63
Safe-or-harmful binary classificationBeavertails
Accuracy84.4
63
Multi-risk safety monitoringBeavertails
Accuracy (%)77.3
63
Concept Extraction Evaluation4 classification datasets average
RAcc98.47
35
Embedding ReconstructionCLIP vision embeddings CC3M and ImageNet
L0 Error60
24
Sparse Autoencoder EvaluationGemma-2-2B activations
L0 Count320
20
Sparse Autoencoder Concept AlignmentCUB
Sparsity0.995
18
Activation ReconstructionPythia model activations
Pearson Correlation Coefficient0.7074
18
Concept Component AnalysisConcept Component Analysis Evaluation Set (test)
Pearson Correlation (MPC)0.7027
18
Showing 10 of 37 rows

Other info

Follow for update