GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

About

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh• 2024

Related benchmarks

Task	Dataset	Result
Machine Translation	WMT literary translation (zh→en) 24	SEGALE Comet Score61.18	13
Machine Translation	WMT literary translation (zh→ru) 24	SEGALE_comet55.67	13
Machine Translation	WMT literary translation (zh→de) 24	SEGALE-COMET Score60.96	13
Helpful Assistant	HH-RLHF	HV Score7.27	10
Safety Alignment	Alpaca 7B (test)	HV Score0.9768	5
Safety Alignment	PKU-SafeRLHF 10K	HV110.7	4
Safety Alignment	PKU-SafeRLHF-10K (test)	HV Score262.2	3
Helpful assistant task	Tulu-2 13B	HV Score0.9737	3
Safety Alignment	Alpaca-65B Weak-to-strong Safety Alignment (test)	HV Score111.7	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord