Better Aggregation in Test-Time Augmentation
About
Test-time augmentation -- the aggregation of predictions across transformed versions of a test input -- is a common practice in image classification. Traditionally, predictions are combined using a simple average. In this paper, we present 1) experimental analyses that shed light on cases in which the simple average is suboptimal and 2) a method to address these shortcomings. A key finding is that even when test-time augmentation produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning Evaluation | Flickr8K Expert (test) | Kendall tau_c51.9 | 76 | |
| Image Captioning Evaluation | Flickr8K-CF (test) | Kendall tau_b34.7 | 65 | |
| Image Captioning Evaluation | THumb (test) | tau_c20.7 | 18 | |
| Brain Tumor Segmentation | BraTS-MEN | Dice0.132 | 7 | |
| Brain Tumor Segmentation | BraTS-PED | Dice Score86.95 | 7 |