Latent Preference Modeling for Cross-Session Personalized Tool Calling
About
Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate--verify--refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Latent Preference Modeling | MPT Context-Free, Preference Recall | Precision76.1 | 19 | |
| Latent Preference Modeling | MPT Context-Free Preference Induction | Precision0.5487 | 19 | |
| Latent Preference Modeling | MPT Context-Free, Preference Transfer | Precision30.92 | 19 | |
| Latent Preference Modeling | MPT Context-Free Average | F1 Score58.5 | 19 | |
| Preference-driven Tool Calling | MPT Context-Guided, Preference Recall | P-EM64.88 | 19 | |
| Preference-driven Tool Calling | MPT Context-Guided, Preference Induction | P-EM37.95 | 19 | |
| Preference-driven Tool Calling | MPT Context-Guided, Preference Transfer | P-EM26.19 | 19 | |
| Preference-driven Tool Calling | MPT Context-Guided Average | OA-F167.18 | 19 |