The SuperActivator Mechanism: Transformers Concentrate Reliable Concept Signals in the Tail

About

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their practical utility is often limited by noisy and inconsistent activations. In this work, we uncover the SuperActivator Mechanism: a transformer dynamic that amplifies concept activation gaps, concentrating the most reliable concept evidence into a small set of high-activation tokens. To develop a theoretical understanding of this mechanism, we prove that concept-aligned attention heads multiplicatively amplify pairwise activation gaps, with already-extreme activations growing fastest. We find that this amplification is not just theoretical, but also occurs empirically on large-scale models: while in- and out-of-concept activation distributions overlap considerably, the in-concept distribution develops a positive tail clearly separated from the noise. These high-tail tokens, which we call SuperActivators, appear consistently across concept-positive samples, making them reliable indicators of concept presence. Accordingly, SuperActivator-based detection improves F1 by up to 0.14 over standard concept activation aggregators and prompting baselines across image and text modalities, models, layers, and concept extraction techniques, demonstrating the generality and practicality of our insights. Further empirical analysis demonstrates that the most reliable SuperActivators are sparse, with detection typically peaking when using only 5-10% of in-concept token activations, and capture more faithful localized semantics than global concept vectors.

Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong• 2025

Related benchmarks

Task	Dataset	Result
Concept Attribution	iSarcasm	Avg F194	258
Concept Attribution	COCO	Average Attribution F155	178
Concept Attribution	CLEVR (test)	F1 Score0.85	160
Concept Attribution	Pascal	F1 (Concept)71	80
Concept Attribution	Sarcasm Dataset	Average F10.74	40
Concept Attribution	GoEmotions v1.0 (test)	Average F142	38
Attribution	OpenSurfaces	Avg Attribution F143	18
Attribution	Pascal	Avg Attribution F150	18
Concept Attribution	Sarcasm	Avg Attribution F142	18
Concept Attribution	GoEmotions	Average Attribution F123	18

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord