Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Role-playing Agent Evaluation on PersonaGym
Loading...
4.13
Action Justification
GPT-4.1
3.35
3.5525
3.755
3.9575
May 16, 2026
Action Justification
Expected Action Score
Linguistic Habits Score
Persona Consistency
Toxicity Control Score
Persona Score
Updated 15d ago
Evaluation Results
Method
Method
Links
Action Justification
Expected Action Score
Linguistic Habits Score
Persona Consistency
Toxicity Control Score
Persona Score
GPT-4.1
2026.05
4.13
4.13
4
4.25
4.88
4.28
DPO-Qwen3-8B
training=Direct Prefer...
2026.05
3.88
3.63
3.75
4.25
4.92
4.09
SFT-Qwen3-8B
training=Supervised Fi...
2026.05
3.5
3.63
3.5
3.88
4.93
3.88
Qwen3-8B
status=base model
2026.05
3.38
3.13
3.13
3.75
4.91
3.66
Feedback
Search any
task
Search any
task