Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deep Minds and Shallow Probes

About

Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient--the representation modulo directions invisible to the probe family--rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

Su Hyeong Lee, Risi Kondor• 2026

Related benchmarks

TaskDatasetResultRank
Content ModerationSafety Evaluation Set Moderation (held-out target labels)
AUROC0.89
6
Harmful Content DetectionBeaverTails Harmful (held-out target labels)
AUROC0.793
6
Jailbreaking DetectionSafety Evaluation Set Jailbreaking (held-out target labels)
AUROC97.4
6
Sentiment AnalysisSafety Evaluation Set Sentiment (held-out target labels)
AUROC97.5
6
Toxicity DetectionSafety Evaluation Set Toxicity (held-out target labels)
AUROC97.6
6
Behavioral Reranking461-prompts safety dataset (test)
Baseline HCR39.5
2
Showing 6 of 6 rows

Other info

Follow for update