Regression-aware Inference with LLMs

About

Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model's output distribution. We show that this inference strategy can be sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.

Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench v1.0 (test)	Average Score0.479	89
LLM-as-a-judge evaluation	MT-Bench	Pearson's r0.547	36
LLM-as-a-judge evaluation	FB Bench (Feedback Bench)	Pearson's r0.683	36
LLM-as-a-judge evaluation	FLASK	Pearson's r0.412	36
LLM-as-a-judge evaluation	Vicuna-bench	Pearson Correlation (r)0.485	16
Feedback Evaluation Alignment	MT-Bench	Kendall's Tau0.398	11
Regression	Macro-average SICKR-STS, STS-B, WMT_RU_EN, WMT_EN_ZH, WMT_SI_EN (test)	Pearson Correlation (r)61.6	11
Feedback Evaluation Alignment	Feedback Bench	Kendall's Tau13	6
Feedback Evaluation Alignment	FLASK	Kendall's Tau0.109	6
Feedback Evaluation Alignment	Vicuna-bench	Kendall's Tau0.122	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord