Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

About

While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.

Minzheng Wang, Run Luo, Yanbo Wang, Zichen Liu, Yuqiao Tan, Tao Tan, Xu Nan, Yinhe Zheng, Wenji Mao• 2026

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	GPQA D	Accuracy (%)38.72	77
General Knowledge Reasoning	MMLU-Pro	Accuracy57.6	64
Adversarial Game Playing	Don’t Say It	GPT-5.1 Performance63.02	12
Adversarial Game Playing	Negotiation	GPT-5.1 Score17.84	12
Adversarial Game Playing	Two Dollar	GPT-5.1 Score40.62	12
Strategic Reasoning	HardCore Don'tSayIt OOD (held-out variant)	Win Rate22.92	12
Strategic Reasoning	RandomValue Negotiation OOD (held-out variant)	Win Rate17.08	12
Strategic Reasoning	VariableSum Dollar OOD (held-out variant)	Win Rate30.47	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord