Learning to Defer to Multiple Experts: Consistent Surrogate Losses, Confidence Calibration, and Conformal Ensembles

About

We study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates -- one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization -- that are analogous to the single expert losses proposed by Mozannar and Sontag (2020) and Verma and Nalisnick (2022), respectively. We then study the frameworks' ability to estimate P( m_j = y | x ), the probability that the jth expert will correctly predict the label for x. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.

Rajeev Verma, Daniel Barrej\'on, Eric Nalisnick• 2022

Related benchmarks

Task	Dataset	Result
Learning to Defer	CIFAR-10H (test)	Coverage37.5	25
Classification with expert deferral	CIFAR-10 redundant expert suite (val)	System Accuracy90.1	21
Learning to Defer	CIFAR-10 with redundant synthetic experts	System Accuracy90.1	21
Learning to Defer	ImageNet Overlapped Dog Expert	Error Rate41.56	20
Learning to Defer	CIFAR-100 Overlapped Animal Expert	Error Rate16.1	20
Learning to Defer	CIFAR-100 Animal Expert	Error Rate18.48	20
Learning to Defer	CIFAR-100 varying-accuracy synthetic expert (test)	Error Rate18.96	20
Learning to Defer	ImageNet Dog Expert	Error Rate42.74	20
Learning to Defer	ImageNet varying-accuracy synthetic expert (val)	Error42.58	20
Learning to Defer	CIFAR-10H	System Accuracy95.9	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord