Test-Time Alignment via Hypothesis Reweighting

About

Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.

Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Safety Score96.7	284
Reward Modeling	RewardBench (full)	Chat Score99.2	41
Out-of-distribution classification	Camelyon17 WILDS OOD (test)	Accuracy75.2	16
Satellite Image Classification	FMOW-WILDS	Worst-Region Accuracy32.8	11
Toxicity Classification	CivilComments WILDS	Worst-Group Accuracy58.1	11
Reward Modeling	RPR n=5 (test)	User Friendliness67.4	10
Species Classification	iWildCam-WILDS	Macro F10.31	9
Sentiment Classification	Amazon WILDS	10th Percentile Accuracy54.2	8
LLM Personalization	PersonalLLM	Score (User 1)95.5	5
Personalized Preference Modeling	RPR (test)	Linguistic Creativity37.5	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord