Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

About

Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.

Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, Songyang Gao, Junzhe Wang, Menghan Zhang, Xiang Gao, Yunwen Chen, Tao Gui• 2023

Related benchmarks

Task	Dataset	Result
Text Classification	AGNews	--	119
Sentiment Classification	IMDB	--	73
Relation Extraction	SciERC	--	68
Relation Extraction	ChemProt	Micro F183.42	40
Text Classification	HyperPartisan	F1 Score93.53	19
Citation Intent Classification	ACL-ARC	Macro F174.53	13
Text Classification	Helpfulness	F1 Score72.27	13
Abstract Sentence Classification	RCT	Micro-F187.41	13

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord