Towards Robustness Against Natural Language Word Substitutions
About
Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either $l_2$-ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel \textit{Adversarial Sparse Convex Combination} (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on the ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, \emph{i.e.}, sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Inference | SNLI (test) | Accuracy87.1 | 681 | |
| Natural Language Inference | SNLI | Accuracy87.1 | 174 | |
| Text Classification | Yahoo! Answers (test) | Clean Accuracy70.7 | 133 | |
| Text Classification | IMDB (test) | CA87.8 | 79 | |
| Sentiment Classification | IMDB | Accuracy80.1 | 41 | |
| Sentiment Analysis | IMDB (test) | Clean Accuracy (%)92.62 | 37 | |
| Rumor Detection | Pheme | DeepWordBug ASR44.53 | 16 | |
| Harmful Content Detection | PHEME New Attacks: ExplainDrive (test) | Accuracy79.88 | 15 | |
| Sentiment Analysis | IMDB (test) | Genetic Score74.8 | 10 | |
| Harmful Content Detection | PHEME Known Attacks: DeepWordBug, TFAdjusted, TREPAT (test) | Accuracy81.15 | 10 |