Revisiting Few-sample BERT Fine-tuning

About

This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, Yoav Artzi• 2020

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)96.9	529
Natural Language Understanding	GLUE (val)	--	201
Image Classification	CIFAR-10 Long-Tailed	Accuracy96.46	71
Image Classification	CIFAR-100 Long-Tailed	Accuracy85.53	71
Sequence Classification	GLUE & SuperGLUE (MultiRC, COPA, RTE, BoolQ, MRPC, CoLA)	MultiRC Accuracy74.05	24
Multi-task Classification	GLUE MRPC, RTE, CoLA (test, val)	MRPC Accuracy85.22	12
Image Classification	CIFAR-10 step	Accuracy96.09	12
Image Classification	CIFAR-100 step	Accuracy84.3	12
Semantic segmentation	Pascal Semantic Segmentation ID Clean (test)	mIoU (Clean)72.09	9
Semantic segmentation	Pascal Semantic Segmentation OOD Corrupted (test)	mIoU (Fog)0.6813	9

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord