Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SDPO: Segment-Level Direct Preference Optimization for Social Agents

About

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang• 2025

Related benchmarks

TaskDatasetResultRank
Social DialogueSOTOPIA Self-Chat
GOAL8.56
28
Social DialogueSOTOPIA Interaction with GPT-4o
Goal Score8.14
28
Social DialogueSOTOPIA Overall (AVG)
AVG Score5.63
11
Social DialogueSOTOPIA Interaction with GPT-4o-mini
GOAL Score7.53
11
Next-item predictionAmazon Review Industrial (test)
HR@30.1032
11
Next-item predictionAmazon Review Office (test)
HR@311.69
11
Showing 6 of 6 rows

Other info

Code

Follow for update