Efficient Vector Representation for Documents through Corruption

About

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

Minmin Chen• 2017

Related benchmarks

Task	Dataset	Result
Sentiment Classification	IMDB (test)	Error Rate10.48	144
Semantic Relatedness	SICK 2014 (test)	Pearson's r0.8381	56
Document Classification	Wikipedia (test)	Classification Error30.24	24
Multi-class classification	20NewsGroup	Accuracy84	24
Multi-label Text Classification	Reuters-21578	Precision @193.45	11
Document Representation Learning	IMDB (train, test)	Learning Time430	5
Word Analogy	Semantic-Syntactic Word Relationship most frequent 30k words	Capital (Common Countries)81.82	2

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord