SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals
About
Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Concept Attribution | iSarcasm | Avg F194 | 258 | |
| Concept Attribution | COCO | Average Attribution F155 | 178 | |
| Concept Attribution | CLEVR (test) | F1 Score0.85 | 160 | |
| Concept Attribution | Pascal | F1 (Concept)71 | 80 | |
| Concept Attribution | Sarcasm Dataset | Average F10.74 | 40 | |
| Concept Attribution | GoEmotions v1.0 (test) | Average F142 | 38 | |
| Attribution | OpenSurfaces | Avg Attribution F143 | 18 | |
| Attribution | Pascal | Avg Attribution F150 | 18 | |
| Concept Attribution | Sarcasm | Avg Attribution F142 | 18 | |
| Concept Attribution | GoEmotions | Average Attribution F123 | 18 |