StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

About

AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) on the full filtered MAGE test pool (15,310 human / 14,656 AI) against four detectors: RoBERTa, Fast-DetectGPT, Binoculars, and MAGE. StealthRL achieves near-zero detection on three of the four detectors and a 0.024 mean TPR@1%FPR, reducing mean AUROC from 0.79 to 0.43 and attaining a 97.6% attack success rate. Critically, attacks transfer to two held-out detectors not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring on 500 matched samples per method, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.

Suraj Ranganath, Atharv Ramesh• 2026

Related benchmarks

Task	Dataset	Result
AI-text detector attack effectiveness	RAID (evaluation)	MAGE ASR3	22
Detection Evasion	MAGE	ASR99.9	18
Adversarial attack on AI-text detectors	Peer-review (evaluation set)	RoBERTa ASR40	12
AI Detector Evasion	MAGE (evaluation set)	ASR (τ=0.5)2.1	12
AI-text detector evasion	M4 evaluation set	MAGE ASR1	12
Paraphrase Quality Assessment	MAGE shared subset (evaluation 300 AI-written samples)	PPL61.6	12
AI-text detector evasion	RAID	ASR (τ=0.5)25.6	10

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord