Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Held-out Skill Evaluation on Multi-family n=225 (holdout)
Loading...
43.5556
Solve Rate
GRPO from SFT init
10.275556
18.915556
27.555556
36.195556
May 27, 2026
Solve Rate
Updated 5d ago
Evaluation Results
Method
Method
Links
Solve Rate
GRPO from SFT init
batches=40, initializa...
2026.05
43.5556
GRPO from base init
batches=40, initializa...
2026.05
27.5556
Base Qwen3-8B
2026.05
11.5556
Feedback
Search any
task
Search any
task