Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reinforcing Human Behavior Simulation via Verbal Feedback

About

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap• 2026

Related benchmarks

TaskDatasetResultRank
Theory of MindHiToM
Accuracy78
64
Theory of MindToMi
Accuracy93
55
Role-Play EvaluationCoSER--
17
Theory of MindFanToM
Accuracy95
14
Human behavior simulationSOUL (Social Understanding and Learning) (test)
FanToM95
9
Social SkillSotopia
Primary Metric47
6
User SimulationMirrorBench
Realism Score (LLM-judge)0.713
6
User SimulationUserLLM
Primary Metric Score93
6
Persona SimulationTwinVoice
Accuracy61
6
Role-PlayLifeChoices
Primary Score80
6
Showing 10 of 11 rows

Other info

Follow for update