Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Regression-aware Inference with LLMs

About

Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model's output distribution. We show that this inference strategy can be sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.

Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar• 2024

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench v1.0 (test)
Chat Score0.595
27
LLM-as-a-judge evaluationMT-Bench
Pearson's r0.547
16
LLM-as-a-judge evaluationVicuna-bench
Pearson Correlation (r)0.485
16
LLM-as-a-judge evaluationFLASK
Pearson's r0.412
16
LLM-as-a-judge evaluationFB Bench (Feedback Bench)
Pearson's r0.683
16
Feedback Evaluation AlignmentMT-Bench
Kendall's Tau0.398
11
Feedback Evaluation AlignmentFeedback Bench
Kendall's Tau13
6
Feedback Evaluation AlignmentFLASK
Kendall's Tau0.109
6
Feedback Evaluation AlignmentVicuna-bench
Kendall's Tau0.122
6
Feedback EvaluationVicuna Bench (test)
Kendall's Tau0.36
5
Showing 10 of 12 rows

Other info

Follow for update