Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

About

The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.

Tom Kempton, Viktor Drobnyi, Maeve Madigan, Stuart Burrell• 2026

Related benchmarks

TaskDatasetResultRank
AI Text DetectionModern RAID Claude generated 200 tokens
TPR@0.1%37.34
14
AI-generated text detectionRAID GPT-4 classic (test)
TPR @ 0.1% Error30.79
14
AI-generated text detectionPeer-Review GPT-4o generated
TPR @ 0.1% FPR33.04
14
AI-generated text detectionPeer-Review Gemini generated
TPR @ FPR=0.1%43.96
14
AI-generated text detectionPeer-Review Claude generated
TPR@FPR=0.1%64.02
14
AI Text DetectionModern RAID GPT-5 generated 200 tokens
TPR@0.1%29.14
14
AI Text DetectionModern RAID Gemini generated 200 tokens
TPR@0.1% FPR23.13
14
AI-generated text detectionRAID ChatGPT Classic (test)
TPR@0.1%67.62
14
AI Text DetectionPeer-Review
TPR@0.1%67.36
6
AI Text DetectionModern RAID
TPR @ FPR=0.1%2.79
6
Showing 10 of 16 rows

Other info

Follow for update