Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
About
Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming increasingly difficult as language model's capabilities of generating human-like texts keep evolving. This study provides a new perspective by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respectively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection. Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies. Our code is available at https://github.com/CLCS-SUSTech/FourierGPT
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine-generated text detection | TruthfulQA | TPR@FPR-1% (ChatGLM)98.15 | 54 | |
| Machine-generated text detection | Essay (test) | GPT4All Score98.13 | 39 | |
| AI-generated text detection | Essay | AUROC (GPT4All)99.68 | 35 | |
| LLM-generated text detection | EvoBench | LLaMA3 Score63.99 | 26 | |
| Machine-generated text detection | MAGE | AUROC (Avg)60.34 | 24 | |
| LLM-generated text detection | Xsum, WritingPrompts, and SQuAD generated by GPT-5-Chat (test) | AUROC64.82 | 15 | |
| LLM-generated text detection | Xsum, WritingPrompts, and SQuAD Gemini-1.5-Flash (test) | AUROC61.25 | 15 | |
| LLM-generated text detection | Xsum, WritingPrompts, and SQuAD generated by GPT-4.1-mini (test) | AUROC63.05 | 15 | |
| LLM-generated text detection | Xsum, WritingPrompts, and SQuAD Aggregated (test) | GPT2-XL54.72 | 15 | |
| Machine-generated text detection | DetectRL Training Text: ChatGPT | -- | 12 |