Assessing Generalization of SGD via Disagreement

About

We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.

Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, J. Zico Kolter• 2021

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet Matched Frequency V2	--	98
Accuracy Estimation	PACS	R20.613	50
Accuracy Estimation	Text2SQL source-target transfers Spider BIRD WikiSQL SParC CoSQL SynSQL-2.5M	MAE14.77	42
Accuracy Estimation	MNIST, USPS, SVHN, COCO, PASCAL, ImageNet source-target transfers	MAE15.11	42
Unsupervised Accuracy Estimation	RR1-WILDS	R-squared0.946	36
Unsupervised Accuracy Estimation	DomainNet	R^20.455	36
Accuracy Estimation	Entity-30 Subpopulation Shift	R20.914	36
Accuracy Estimation	Entity-13 Subpopulation Shift	R20.901	36
Unsupervised Accuracy Estimation	Office-Home	R^20.132	36
Accuracy Estimation	Living-17 Subpopulation Shift	R20.652	36

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord