MAGE: Machine-generated Text Detection in the Wild
About
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection to mitigate risks like the spread of fake news and plagiarism. Existing research has been constrained by evaluating detection methods on specific domains or particular language models. In practical scenarios, however, the detector faces texts from various domains or LLMs without knowing their sources. To this end, we build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs. Empirical results show challenges in distinguishing machine-generated texts from human-authored ones across various scenarios, especially out-of-distribution. These challenges are due to the decreasing linguistic distinctions between the two sources. Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios. We release our resources at https://github.com/yafuly/MAGE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM-generated text detection | RAID Abstract | ROC AUC100 | 8 | |
| LLM-generated text detection | RAID Reviews | ROC AUC1 | 8 | |
| Detection of LLM generated text | MAGE Topic-based 3.5-turbo | Detection Accuracy99.98 | 8 | |
| Detection of LLM generated text | MAGE News | ROC AUC @ FPR=1%0.00e+0 | 8 | |
| Detection of LLM generated text | Xsum | Paraphrase (4o-mini)0.0155 | 8 | |
| Detection of LLM generated text | MAGE News Topic-based 3.5-turbo | Detection Performance99.57 | 8 | |
| LLM-generated text detection | MAGE News short text (<= 30 words) | AUROC89.44 | 8 | |
| LLM-generated text detection | MAGE QA short text (<= 30 words) | AUROC0.9345 | 8 | |
| LLM-generated text detection | RAID Poetry | ROC AUC95.88 | 8 | |
| LLM-generated text detection | RAID Wikipedia-related samples | GPT-4 Performance Score97.05 | 8 |