Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

About

Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.

Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen• 2025

Related benchmarks

TaskDatasetResultRank
Reddit Summary AlignmentReddit Summary normalized rewards (test)
Faithfulness Reward0.55
60
Helpful Assistant AlignmentHelpful Assistant normalized rewards (test)
Helpfulness Reward (r1)48
60
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)
Helpfulness Win Rate31
31
HelpfulnessAlpaca Eval
Alpaca Eval (%)14.71
22
Prosocial AlignmentPKUSafeRLHF (test)
MIP69.2
14
Prosocial AlignmentNicheHazardQA (test)
MIP64.1
14
Prosocial AlignmentHEX-PHI (test)
MIP57.6
14
Prosocial AlignmentProsocialBench (test)
MIP64.4
14
Prosocial AlignmentHarmEval (test)
MIP57.6
14
Safety AlignmentAlpaca 7B (test)
HV Score1.0895
5
Showing 10 of 13 rows

Other info

Follow for update