Cold-start Active Learning through Self-supervised Language Modeling

About

Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time.

Michelle Yuan, Hsuan-Tien Lin, Jordan Boyd-Graber• 2020

Related benchmarks

Task	Dataset	Result
Text Classification	AG-News	Accuracy83.56	248
Text Classification	IMDB	Accuracy84.26	119
Text Classification	AMAZON	Accuracy89.87	63
Text Classification	MNLI	Accuracy60.12	32
Text Classification	Yelp	Accuracy92.48	21
Disease Classification	MIMIC-CXR	F1 Score0.816	19
Text Classification	SemEval	--	17
Text Classification	GoEmotions	Accuracy26.21	9
Text Classification	MIMIC-III	Accuracy83.41	9
Medical Image Classification	CHIFIR	Accuracy94.2	7

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord