ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

About

With the development of large language models (LLMs), zero-shot learning has attracted much attention for various NLP tasks. Different from prior works that generate training data with billion-scale natural language generation (NLG) models, we propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. To realize this, we first conduct contrastive pretraining to learn an unsupervised dense retriever for extracting the most relevant documents using class-descriptive verbalizers. We then further propose two simple strategies, namely Verbalizer Augmentation with Demonstrations and Self-consistency Guided Filtering to improve the topic coverage of the dataset while removing noisy examples. Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models. Besides, REGEN can be naturally integrated with recently proposed large language models to boost performance.

Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, Chao Zhang• 2023

Related benchmarks

Task	Dataset	Result
Sentiment Classification	SST2 (test)	Accuracy87.84	233
Sentiment Analysis	SST-2	Accuracy85.32	165
Sentiment Classification	IMDB (test)	--	144
Topic Classification	AG News (test)	Accuracy85	116
Sentiment Analysis	IMDB	Accuracy87.84	73
Topic Classification	DBPedia (test)	Accuracy87.6	64
Sentiment Classification	Yelp (test)	Accuracy93	46
Topic Classification	Yahoo (test)	Accuracy59.4	36
Sentiment Analysis	Yelp	Accuracy89	34
Sentiment Analysis	Rotten Tomato	Accuracy81.42	25

Showing 10 of 13 rows

Other info

Code

Follow for update

@wizwand_team Discord