RoleBench

Benchmarks

Task Name	Dataset Name	SOTA Result
Role-playing evaluation	RoleBench	LLM-as-a-Judge Score85.67	44
Role-Playing	RoleBench (test)	LLM-as-a-Judge Score88.82	42
Role Fidelity	RoleBench (test)	RAW Score36.4	10
Instruction Generalization	RoleBench Instruction Generalization	CUS Score57.6	10
Role Generalization	RoleBench English 1.0 (Role Generalization)	CUS Score60.2	7
Instruction Generalization	RoleBench Chinese instruction generalization 1.0	ROUGE-L (CUS)53.7	7
Instruction Generalization	RoleBench instruction generalization	GPT-4 Win Rate55.8	5
Role-playing	RoleBench Chinese (instruction generalization)	Win Rate (vs GPT-4)36.4	4
Role-playing Instruction Following	RoleBench English Role Generalization	Win Rate (GPT-4)64.5	4

Showing 9 of 9 rows