Distributed Representations of Sentences and Documents

About

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Quoc V. Le, Tomas Mikolov• 2014

Related benchmarks

Task	Dataset	Result
Subjectivity Classification	Subj	Accuracy90.5	343
Sentiment Analysis	IMDB (test)	--	306
Text Classification	AG News (test)	--	293
Text Classification	TREC	Accuracy91.8	281
Question Classification	TREC	Accuracy91.8	262
Sentiment Classification	SST-2	Accuracy87.8	190
Text Classification	SST-2 (test)	Accuracy87.8	185
Sentiment Analysis	SST-5 (test)	Accuracy48.7	177
Text Classification	MR	Accuracy74.8	174
Opinion Polarity Detection	MPQA	Accuracy74.2	158

Showing 10 of 83 rows

...

Other info

Follow for update

@wizwand_team Discord