Linear Explanations for Individual Neurons
About
In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Neuron Labeling | ImageNet-1K | DMA56.81 | 60 | |
| Neuron Labeling | ResNet101 neurons | SCS Score25.36 | 15 | |
| Neuron Labeling | ResNet50 neurons | SCS Score25.51 | 15 | |
| Neuron Labeling | SAE-TopK neurons | SCS Score35.23 | 15 | |
| Neuron Labeling | SAE Vanilla neurons | SCS Score27.44 | 15 | |
| Neuron Labeling | SAE-TopK Evaluated Neurons | AUC0.98 | 15 | |
| Neuron Labeling | ResNet50 evaluated neurons | AUC89 | 15 | |
| Neuron Labeling | SAE Vanilla (evaluated neurons) | AUC0.78 | 15 | |
| Neuron Labeling Faithfulness | Evaluated Neurons ResNet50 and SAE-TopK | AUC88 | 15 | |
| Neuron Labeling | ResNet101 Neurons (evaluated) | AUC89 | 15 |