HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
About
LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: lacking data with high-quality reasoning traces, and lacking reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering, and construct human-aligned principles and reward models. Leveraging these resources, we train HER models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97% gain on the Minimax Role-Play Bench. Our datasets, principles, and models are released to facilitate future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Theory of Mind | HiToM | Accuracy56 | 64 | |
| Role-Play | Minimax Role-Play Bench Full 17-Model Leaderboard 1.0 | Overall Score84.65 | 17 | |
| Role-Play Evaluation | Minimax Role-Play Bench | Average Score84.65 | 17 | |
| Role-Play Evaluation | CoSER | Avg Score57.3 | 17 | |
| Role-playing | DynSess-Eval Human Evaluation 1.0 (test) | Average Score2.99 | 10 | |
| Human behavior simulation | SOUL (Social Understanding and Learning) (test) | FanToM55 | 9 | |
| Role-playing evaluation | DynSess-Eval | Average Performance (Auto)3.23 | 8 | |
| Role-Play | LifeChoices | Primary Score75 | 6 | |
| User Simulation | UserLLM | Primary Metric Score53.7 | 6 | |
| Persona Simulation | TwinVoice | Accuracy39 | 6 |