Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Review Generation on Board Game Playtesting Dataset
Loading...
99.46
Factuality
GPT-5.1
58.2864
68.9757
79.665
90.3543
Jan 12, 2026
Factuality
Diversity (Dist-2)
Divergence (Div.)
Updated 4d ago
Evaluation Results
Method
Method
Links
Factuality
Diversity (Dist-2)
Divergence (Div.)
GPT-5.1
2026.01
99.46
0.6934
4.26
Qwen3-235B
2026.01
98.95
0.6572
3.56
MeepleLM
2026.01
98.86
0.7117
4.34
Gemini3-Pro
2026.01
98.28
0.648
3.98
Qwen3-8B
2026.01
97.88
0.5936
1.58
MeepleLM
Ablation=w/o Persona
2026.01
92.13
0.6771
3.56
MeepleLM
Ablation=w/o MDA
2026.01
91.56
0.685
3.7
MeepleLM
Ablation=w/o Rulebook
2026.01
59.87
0.697
3.3
Feedback
Search any
task
Search any
task