How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants
About
Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Discriminative Reasoning | RPEval Multi-MACRO, Explicit Memory | IA Score0.4 | 16 | |
| Discriminative Reasoning | RPEval Multi-MICRO Explicit Memory | IA Score0.71 | 16 | |
| Discriminative Task | RPEval Implicit Memory, Multi-Preference | IA (Macro)0.44 | 16 | |
| Generative task evaluation | Generative tasks single-preference setting | Accuracy (Ignored)78 | 16 | |
| Generative tasks | Generative tasks multi-preference setting | Macro Acc (IA)63 | 16 | |
| Discriminative Reasoning | RPEval Single, Explicit Memory | Ignorance Score0.5 | 16 | |
| Discriminative Task | RPEval Implicit Memory, Single-Preference | Ignorance Score0.54 | 16 | |
| Intent Alignment and Over-personalization Detection | Personalization Dataset Discriminative Setting | Macro IA38 | 8 | |
| Personalized Response Generation | RPEVAL | Macro Accuracy24 | 4 | |
| Personalized Response Generation | Real-world failure cases from large-scale commercial PA | Macro Accuracy73.4 | 4 |