Better Aggregation in Test-Time Augmentation

About

Test-time augmentation -- the aggregation of predictions across transformed versions of a test input -- is a common practice in image classification. Traditionally, predictions are combined using a simple average. In this paper, we present 1) experimental analyses that shed light on cases in which the simple average is suboptimal and 2) a method to address these shortcomings. A key finding is that even when test-time augmentation produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.

Divya Shanmugam, Davis Blalock, Guha Balakrishnan, John Guttag• 2020

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-C Severity 5 (test)	Mean Error Rate (Severity 5)55.6	216
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c51.9	76
Image Captioning Evaluation	Flickr8K-CF (test)	Kendall tau_b34.7	65
Image Captioning Evaluation	THumb (test)	tau_c20.7	18
Skin lesion classification	Derm7pt	--	15
Brain Tumor Segmentation	BraTS-MEN	Dice0.132	7
Brain Tumor Segmentation	BraTS-PED	Dice Score86.95	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord