Progress measures for grokking via mechanistic interpretability

About

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt• 2023

Related benchmarks

Task	Dataset	Result
Refusal Induction	Refusal Induction held-in and held-out	Activation Patching48	9
Sycophancy Reduction	Sycophancy Reduction (held-in and held-out)	Activation Patching77	9
Verse Style-Transfer	Verse Style-Transfer (held-in and held-out)	Activation Patching20	9

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord