Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Chess Reasoning Quality Evaluation on Lichess Puzzle Database (held-out positions)
Loading...
0.218
WR MAE
VPS
0.19744
0.33622
0.475
0.61378
Apr 3, 2026
WR MAE
PV Overlap
Consistency
Updated 20d ago
Evaluation Results
Method
Method
Links
WR MAE
PV Overlap
Consistency
VPS
Backbone=Qwen3-8B
2026.04
0.218
37.6
97.8
VPS
Backbone=DeepSeek-R1-D...
2026.04
0.286
36.2
98.5
SFT only
Backbone=Qwen3-8B
2026.04
0.31
32.3
96.1
SFT only
Backbone=DeepSeek-R1-D...
2026.04
0.346
34.1
97.8
SFT + GRPO
Backbone=Qwen3-8B
2026.04
0.452
18.3
72.1
SFT + GRPO
Backbone=DeepSeek-R1-D...
2026.04
0.732
33
31.2
Feedback
Search any
task
Search any
task