Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
About
Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Personalized Text Generation | LaMP-4 | ROUGE17 | 5 | |
| Personalized Text Generation | LaMP-5 | ROUGE43 | 5 | |
| Personalized Text Generation | LongLaMP Abstract | ROUGE26.1 | 5 | |
| Personalized Text Generation | LongLaMP Topic | ROUGE21.3 | 5 | |
| Personalized Text Generation | LongLaMP Product | ROUGE19.4 | 5 | |
| Personalized Reward Modeling | LaMP-4 | ROUGE@N=3017.9 | 4 |