Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

About

Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.

Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy• 2023

Related benchmarks

Task	Dataset	Result
Subjectivity Classification	Subj	Accuracy76.8	343
Text Classification	TREC	Accuracy73.59	311
Question Classification	TREC	Accuracy73.91	274
Topic Classification	AG-News	Accuracy78.05	228
Text Classification	AGNews	Accuracy78.28	161
Sentiment Analysis	SST-5	Accuracy34.69	123
Text Classification	SST-5	Accuracy34.92	119
Text Classification	Subj	CA (%)77.03	94
Text Classification	SST2	Accuracy94.3	71
Sentiment Analysis	FPB	Accuracy83.52	65

Showing 10 of 61 rows

Other info

Follow for update

@wizwand_team Discord