Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

About

Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.

Erik Nordby, Tasha Pais, Aviel Parrack• 2026

Related benchmarks

TaskDatasetResultRank
Deception DetectionLiars' Bench Insider Trading (test)
AUROC0.953
3
Deception DetectionLiars' Bench Harm-Pressure Knowledge (test)
AUROC0.91
3
Deception DetectionLiars' Bench Instructed Deception (test)
AUROC0.889
3
Deception DetectionLiars' Bench Harm-Pressure Choice (test)
AUROC0.909
3
Deception DetectionLiars' Bench Convincing Game (test)
AUROC1
3
Showing 5 of 5 rows

Other info

Follow for update