Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

About

We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithms work with any underlying model and (unknown) data-generating distribution and do not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use the framework to provide new calibration methods for several core machine learning tasks, with detailed worked examples in computer vision and tabular medical data.

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand\`es, Michael I. Jordan, Lihua Lei• 2021

Related benchmarks

Task	Dataset	Result
Reinforcement Learning from Verifiable Rewards	HEAD-QA	AR37.8	30
Distribution Shift Robustness	Sixteen Adversarial Cells MedQA + GSM8K (eval)	Violations4	10
Expert-Iteration RLVR	MedQA, HEAD-QA, ARC-C, and CaseHOLD	Pathwise Clean Score4	10
Natural Language Inference	medNLI	AR (%)66.6	10
Mathematical Reasoning	GSM8K	AR (%)9	10
Selective Prediction	NyayaBench v2	Guaranteed Test Coverage (alpha=0.20)26	9
Question Answering	MedQA	AR (%)24.3	9
Question Answering	CaseHold	AR (%)15	9
Selective Prediction	MASSIVE (test)	Guaranteed Test Coverage (alpha=0.10)94	8
Selective Prediction	CLINC-150 v1 (test)	Performance (α=0.10)94.3	7

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord