Test-time reward-guided alignment of language models by importance sampling on pre-logit space

About

Test-time alignment of large language models (LLMs) attracts attention because fine-tuning of LLMs requires high computational costs. In this paper, we propose a new test-time reward-guided alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.

Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@141.4	1043
Mathematical Reasoning	GSM8K (test)	Accuracy67.5	954
Instruction Following	AlpacaEval 2.0	Win Rate2.86	722
Reward Maximization	SHP	Win Rate0.53	12
Reward model verification	HH-RLHF	Win Rate47.3	12
Question Answering	TruthfulQA	BLEU Accuracy42.6	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord