Efficient Vector Representation for Documents through Corruption
About
We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentiment Classification | IMDB (test) | Error Rate10.48 | 144 | |
| Semantic Relatedness | SICK 2014 (test) | Pearson's r0.8381 | 56 | |
| Document Classification | Wikipedia (test) | Classification Error30.24 | 24 | |
| Multi-class classification | 20NewsGroup | Accuracy84 | 24 | |
| Multi-label Text Classification | Reuters-21578 | Precision @193.45 | 11 | |
| Document Representation Learning | IMDB (train, test) | Learning Time430 | 5 | |
| Word Analogy | Semantic-Syntactic Word Relationship most frequent 30k words | Capital (Common Countries)81.82 | 2 |