Role-playing Dialogue Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
FURINA-Bench English		Context Reliance42.99	15	3mo ago
FURINA-Bench Chinese	Qwen3-32B	Context Reliance71.39	12	3mo ago
RoleChat gold-standard evaluation set	GPT-4.1	Logical Coherence96.6	10	3mo ago

Showing 3 of 3 rows