RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
About
Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging-lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM-generated text detection | RAID News | ROC AUC100 | 8 | |
| LLM-generated text detection | RAID Recipe | ROC AUC0.9973 | 8 | |
| Detection of LLM generated text | XSum Paraphrase 4o-mini | Detection Performance75.54 | 8 | |
| Detection of LLM generated text | XSum Paraphrase 4o | Detection Performance0.7069 | 8 | |
| Detection of LLM generated text | XSum Revise 4o-mini | Detection Performance77.28 | 8 | |
| Detection of LLM generated text | XSum Polish 4o-mini | Detection Performance76.65 | 8 | |
| LLM-generated text detection | RAID Books | ROC AUC99.95 | 8 | |
| Composite Text Detection | RAID Human and Paraphrase | ROC AUC0.5846 | 8 | |
| Composite Text Detection | RAID Human and Revise | ROC AUC0.6082 | 8 | |
| Composite Text Detection | RAID Paraphrase and Revise | ROC AUC66.71 | 8 |