Certified Robustness to Text Adversarial Attacks by Randomized [MASK]
About
Recently, few certified defense methods have been developed to provably guarantee the robustness of a text classifier to adversarial synonym substitutions. However, all existing certified defense methods assume that the defenders are informed of how the adversaries generate synonyms, which is not a realistic scenario. In this paper, we propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text, in which the above unrealistic assumption is no longer necessary. The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations. We can certify the classifications of over 50% texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset. The experimental results show that our randomized smoothing method significantly outperforms recently proposed defense methods across multiple datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentiment Classification | SST2 (test) | Accuracy81.6 | 214 | |
| Text Classification | IMDB (test) | CA93.2 | 79 | |
| Sentiment Analysis | IMDB (test) | Clean Accuracy (%)94.33 | 37 | |
| AI-generated text detection | Cross-genre (test) | OA87 | 32 | |
| AIGT detection | HC3 PWWS attack, AI to Human (in-domain) | Overall Accuracy100 | 28 | |
| AI-generated text detection | mixed-source AI -> Human GPT-2, GPT-Neo, GPT-J, LLaMa, GPT-3 | Overall Accuracy94 | 26 | |
| AI-generated text detection | HC3 (test) | F1 (Overall)95.67 | 18 | |
| AIGT detection | HC3 Deep-Word-Bug attack Overall (in-domain) | OA100 | 14 | |
| AIGT detection | HC3 Pruthi attack Overall (in-domain) | Overall Accuracy100 | 14 | |
| AI-generated text detection | SeqXGPT-Bench cross-genre | Precision (AI)89.84 | 14 |