Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

About

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.

Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou• 2026

Related benchmarks

TaskDatasetResultRank
AI-text detector attack effectivenessRAID (evaluation)
MAGE ASR86
22
Detection EvasionMAGE
ASR73
18
AI-text detector evasionM4 evaluation set
MAGE ASR70
12
AI Detector EvasionMAGE (evaluation set)
ASR (τ=0.5)72.5
12
Adversarial attack on AI-text detectorsPeer-review (evaluation set)
RoBERTa ASR41
12
Paraphrase Quality AssessmentMAGE shared subset (evaluation 300 AI-written samples)
PPL20.6
12
AI-text detector evasionRAID
ASR (τ=0.5)95.2
10
Showing 7 of 7 rows

Other info

Follow for update