Deep Minds and Shallow Probes

About

Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient--the representation modulo directions invisible to the probe family--rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

Su Hyeong Lee, Risi Kondor• 2026

Related benchmarks

Task	Dataset	Result
Content Moderation	Safety Evaluation Set Moderation (held-out target labels)	AUROC0.89	6
Harmful Content Detection	BeaverTails Harmful (held-out target labels)	AUROC0.793	6
Jailbreaking Detection	Safety Evaluation Set Jailbreaking (held-out target labels)	AUROC97.4	6
Sentiment Analysis	Safety Evaluation Set Sentiment (held-out target labels)	AUROC97.5	6
Toxicity Detection	Safety Evaluation Set Toxicity (held-out target labels)	AUROC97.6	6
Behavioral Reranking	461-prompts safety dataset (test)	Baseline HCR39.5	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord