Linear Explanations for Individual Neurons

About

In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.

Tuomas Oikarinen, Tsui-Wei Weng• 2024

Related benchmarks

Task	Dataset	Result
Neuron Labeling	ImageNet-1K	DMA56.81	60
Neuron Labeling	ResNet101 neurons	SCS Score25.36	15
Neuron Labeling	ResNet50 neurons	SCS Score25.51	15
Neuron Labeling	SAE-TopK neurons	SCS Score35.23	15
Neuron Labeling	SAE Vanilla neurons	SCS Score27.44	15
Neuron Labeling	SAE-TopK Evaluated Neurons	AUC0.98	15
Neuron Labeling	ResNet50 evaluated neurons	AUC89	15
Neuron Labeling	SAE Vanilla (evaluated neurons)	AUC0.78	15
Neuron Labeling Faithfulness	Evaluated Neurons ResNet50 and SAE-TopK	AUC88	15
Neuron Labeling	ResNet101 Neurons (evaluated)	AUC89	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord