Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

About

Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.

Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu• 2026

Related benchmarks

TaskDatasetResultRank
Discriminative ReasoningRPEval Multi-MACRO, Explicit Memory
IA Score0.4
16
Discriminative ReasoningRPEval Multi-MICRO Explicit Memory
IA Score0.71
16
Discriminative TaskRPEval Implicit Memory, Multi-Preference
IA (Macro)0.44
16
Generative task evaluationGenerative tasks single-preference setting
Accuracy (Ignored)78
16
Generative tasksGenerative tasks multi-preference setting
Macro Acc (IA)63
16
Discriminative ReasoningRPEval Single, Explicit Memory
Ignorance Score0.5
16
Discriminative TaskRPEval Implicit Memory, Single-Preference
Ignorance Score0.54
16
Intent Alignment and Over-personalization DetectionPersonalization Dataset Discriminative Setting
Macro IA38
8
Personalized Response GenerationRPEVAL
Macro Accuracy24
4
Personalized Response GenerationReal-world failure cases from large-scale commercial PA
Macro Accuracy73.4
4
Showing 10 of 10 rows

Other info

Follow for update