Share your thoughts, 1 month free Claude Pro on usSee more

Role-playing

Benchmarks

Dataset Name	SOTA Method	Metric
Alpaca-P	Llama-3.1-8B-Instruct	LLM-as-a-Judge Score86.66	91	1mo ago
CharacterBench	CRPO	Overall Average Score4.043	70	29d ago
CharacterBench latest (full)	CRPO	Overall Score4.525	47	2mo ago
RoleBench (test)	Llama-3.1-8B-Instruct	LLM-as-a-Judge Score88.82	42	1mo ago
RoleLLM		GPT-4o Win Rate94.46	28	4mo ago
CharacterBench 1.0 (test)	Qwen2.5-7B-Instruct + Character-R1	MC4.444	28	4mo ago
RPGBench Aggregate (Overall)	CoRL	Avg Score-0.026	18	4mo ago
RPGBench Dialogue Shift (Generalization)		Turn Composition-0.956	18	4mo ago
RPGBench Character Shift (Generalization)		Deviation Score (Literature)-0.8	18	4mo ago
RPGBench User Shift Generalization	CoRL	RP Score (German)-0.016	18	4mo ago
RPGBench In-distribution	RL	R-EMI-0.034	18	4mo ago
Role-playing evaluation (Main characters)	Codified FSM	ROUGE-L (Haruhi)83.88	12	4mo ago
DynSess-Eval Human Evaluation 1.0 (test)	Doubao-1.5-pro-character	Average Score3.38	10	1mo ago
RoleMRC	MOA (GRPO)	KR0.67	7	4mo ago
PersonaGym		Engagement (EA)4.98	7	4mo ago
Alpaca-P 50P x 100Q scale (test)	Persona-Pruner	Alpaca-P Score81.97	5	1mo ago
Alpaca-P CoSER evaluation framework (test 20 randomly sampled instances)	SliceGPT	Anthropomorphism4.38	5	1mo ago
Role-playing evaluation (Minor characters)	Codified Profile	K-On! ROUGE-L21.21	5	4mo ago
Alpaca-P and Character Fidelity CoSER criterion (test)		Alpaca-P Score83.35	4	1mo ago
CharacterEval unseen roles transfer setting	Baichuan2-7B + PCL	CC2.749	4	4mo ago
Role-playing human assessment set (test)	Baichuan2-7B+PCL	Win Count602	4	4mo ago
RoleBench Chinese (instruction generalization)	RoleGLM	Win Rate (vs GPT-4)36.4	4	4mo ago

Showing 22 of 22 rows