Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Linear Explanations for Individual Neurons

About

In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.

Tuomas Oikarinen, Tsui-Wei Weng• 2024

Related benchmarks

TaskDatasetResultRank
Neuron LabelingImageNet-1K
DMA56.81
60
Neuron LabelingResNet101 neurons
SCS Score25.36
15
Neuron LabelingResNet50 neurons
SCS Score25.51
15
Neuron LabelingSAE-TopK neurons
SCS Score35.23
15
Neuron LabelingSAE Vanilla neurons
SCS Score27.44
15
Neuron LabelingSAE-TopK Evaluated Neurons
AUC0.98
15
Neuron LabelingResNet50 evaluated neurons
AUC89
15
Neuron LabelingSAE Vanilla (evaluated neurons)
AUC0.78
15
Neuron Labeling FaithfulnessEvaluated Neurons ResNet50 and SAE-TopK
AUC88
15
Neuron LabelingResNet101 Neurons (evaluated)
AUC89
15
Showing 10 of 11 rows

Other info

Follow for update