RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
About
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples. Moreover, RoCIT on RoleBench yields RoleLLaMA (English) and RoleGLM (Chinese), significantly enhancing role-playing abilities and even achieving comparable results with RoleGPT (using GPT-4).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Role-playing | RPGBench Character Shift (Generalization) | Deviation Score (Literature)-0.579 | 18 | |
| Role-playing | RPGBench Dialogue Shift (Generalization) | Turn Composition-0.415 | 18 | |
| Role-playing | RPGBench In-distribution | R-EMI-0.065 | 18 | |
| Role-playing | RPGBench User Shift Generalization | RP Score (German)-0.221 | 18 | |
| Role-playing | RPGBench Aggregate (Overall) | Avg Score-0.265 | 18 | |
| Instruction Generalization | RoleBench instruction generalization | CUS Score57.6 | 10 | |
| Instruction Generalization | RoleBench Chinese instruction generalization 1.0 | ROUGE-L (CUS)53.7 | 7 | |
| Role Generalization | RoleBench English 1.0 (Role Generalization) | CUS Score60.2 | 7 | |
| Instruction Generalization | RoleBench instruction generalization | GPT-4 Win Rate55.8 | 5 | |
| Role-playing | RoleBench Chinese (instruction generalization) | Win Rate (vs GPT-4)36.4 | 4 |