Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

T-POP: Test-Time Personalization with Online Preference Feedback

About

Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai• 2025

Related benchmarks

TaskDatasetResultRank
PersonalizationPersonal
Creative Score (ArmoRM)0.991
33
PersonalizationHelpSteer
Creative ArmoRM Score0.51
18
PersonalizationTruthful QA
Creative Score (ArmoRM)53
18
PersonalizationUltra Chat
Creative ArmoRM Score50
18
Test-Time PersonalizationHelpSteer
Creative Win Rate99.5
15
Test-Time PersonalizationTruthful QA
Creative Win Rate99.6
15
Showing 6 of 6 rows

Other info

Follow for update