StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors
About
AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) on the full filtered MAGE test pool (15,310 human / 14,656 AI) against four detectors: RoBERTa, Fast-DetectGPT, Binoculars, and MAGE. StealthRL achieves near-zero detection on three of the four detectors and a 0.024 mean TPR@1%FPR, reducing mean AUROC from 0.79 to 0.43 and attaining a 97.6% attack success rate. Critically, attacks transfer to two held-out detectors not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring on 500 matched samples per method, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| AI-text detector attack effectiveness | RAID (evaluation) | MAGE ASR3 | 22 | |
| Detection Evasion | MAGE | ASR99.9 | 18 | |
| Adversarial attack on AI-text detectors | Peer-review (evaluation set) | RoBERTa ASR40 | 12 | |
| AI Detector Evasion | MAGE (evaluation set) | ASR (τ=0.5)2.1 | 12 | |
| AI-text detector evasion | M4 evaluation set | MAGE ASR1 | 12 | |
| Paraphrase Quality Assessment | MAGE shared subset (evaluation 300 AI-written samples) | PPL61.6 | 12 | |
| AI-text detector evasion | RAID | ASR (τ=0.5)25.6 | 10 |