MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
About
This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Classification | AG News (test) | Accuracy91.51 | 210 | |
| Text Classification | Yahoo! Answers (test) | Clean Accuracy74.1 | 133 | |
| Ontology Classification | DBPedia (test) | Accuracy99.2 | 53 | |
| Sentence Classification | Amazon Review (test) | Accuracy92.79 | 15 |