CoSER: A Comprehensive Literary Dataset and Framework for Training and Evaluating LLM Role-Playing and Persona Simulation
About
Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Role-playing | Role-playing evaluation (Main characters) | ROUGE-L (Haruhi)83.19 | 12 | |
| Role-playing performance evaluation | Fandom (test) | Haruhi Adherence Score60.54 | 8 | |
| Role-playing performance evaluation | Bandori (test) | PoPiPa Score75.29 | 8 | |
| Role-playing performance | Role-playing Artifacts Minor characters 1.0 | Score (K-On!)82.34 | 7 | |
| Role-playing | Role-playing evaluation (Minor characters) | K-On! ROUGE-L20.19 | 5 |