Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Uncertainty Quantification and Deep Ensembles

About

Deep Learning methods are known to suffer from calibration issues: they typically produce over-confident estimates. These problems are exacerbated in the low data regime. Although the calibration of probabilistic models is well studied, calibrating extremely over-parametrized models in the low-data regime presents unique challenges. We show that deep-ensembles do not necessarily lead to improved calibration properties. In fact, we show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models. This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce: data-augmentation, ensembling, and post-processing calibration methods. Although standard ensembling techniques certainly help boost accuracy, we demonstrate that the calibration of deep ensembles relies on subtle trade-offs. We also find that calibration methods such as temperature scaling need to be slightly tweaked when used with deep-ensembles and, crucially, need to be executed after the averaging process. Our simulations indicate that this simple strategy can halve the Expected Calibration Error (ECE) on a range of benchmark classification problems compared to standard deep-ensembles in the low data regime.

Rahul Rahaman, Alexandre H. Thiery• 2020

Related benchmarks

TaskDatasetResultRank
Breast Cancer Subtype PredictionBRACS → BACH
F1 Score0.7084
16
Breast Cancer Subtype PredictionBRACS original
F1 Score58.36
13
Breast Cancer Subtype PredictionBRACS 512x512
F1 Score62.7
13
Text ClassificationTweet FTC-metadataset mini 10% (full dataset 100%)
NLL0.4979
8
Text ClassificationSST-2 FTC-metadataset mini 10% 1.0 (test)
AURAC98.22
8
Text ClassificationIMDB FTC-metadataset mini 10% 1.0 (test)
AURAC Score0.9821
8
Text ClassificationDBpedia FTC-metadataset mini 10%
AUROC0.9998
8
Text ClassificationSST-2 FTC-metadataset mini (10%) (full dataset 100%)
NLL0.132
8
Text ClassificationIMDB FTC-metadataset mini 10%
NLL0.1255
8
Text ClassificationIMDB FTC-metadataset full
Avg Prediction Set Size1.1003
8
Showing 10 of 33 rows

Other info

Follow for update