Scaling and evaluating sparse autoencoders

About

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Leo Gao, Tom Dupr\'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100	Accuracy41.8	302
Misalignment Detection	Taylor	Accuracy94.3	63
Safe-or-harmful binary classification	Beavertails	Accuracy84.4	63
Multi-risk safety monitoring	Beavertails	Accuracy (%)77.3	63
Concept Extraction Evaluation	4 classification datasets average	RAcc98.47	35
Embedding Reconstruction	CLIP vision embeddings CC3M and ImageNet	L0 Error60	24
Sparse Autoencoder Evaluation	Gemma-2-2B activations	L0 Count320	20
Sparse Autoencoder Concept Alignment	CUB	Sparsity0.995	18
Activation Reconstruction	Pythia model activations	Pearson Correlation Coefficient0.7074	18
Concept Component Analysis	Concept Component Analysis Evaluation Set (test)	Pearson Correlation (MPC)0.7027	18

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord