Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

About

The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.

Tom Kempton, Viktor Drobnyi, Maeve Madigan, Stuart Burrell• 2026

Related benchmarks

Task	Dataset	Result
AI Text Detection	Modern RAID Claude generated 200 tokens	TPR@0.1%37.34	14
AI-generated text detection	RAID GPT-4 classic (test)	TPR @ 0.1% Error30.79	14
AI-generated text detection	Peer-Review GPT-4o generated	TPR @ 0.1% FPR33.04	14
AI-generated text detection	Peer-Review Gemini generated	TPR @ FPR=0.1%43.96	14
AI-generated text detection	Peer-Review Claude generated	TPR@FPR=0.1%64.02	14
AI Text Detection	Modern RAID GPT-5 generated 200 tokens	TPR@0.1%29.14	14
AI Text Detection	Modern RAID Gemini generated 200 tokens	TPR@0.1% FPR23.13	14
AI-generated text detection	RAID ChatGPT Classic (test)	TPR@0.1%67.62	14
AI Text Detection	Peer-Review	TPR@0.1%67.36	6
AI Text Detection	Modern RAID	TPR @ FPR=0.1%2.79	6

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord