Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-step reasoning on GAIA (test)
Loading...
31
Level 1 Accuracy
Skill-R1 (GRPO, Qwen3-4b)
2.92
10.21
17.5
24.79
May 10, 2026
Level 1 Accuracy
Level 2 Accuracy
Level 3 Accuracy
Overall Accuracy
Updated 22d ago
Evaluation Results
Method
Method
Links
Level 1 Accuracy
Level 2 Accuracy
Level 3 Accuracy
Overall Accuracy
Skill-R1 (GRPO, Qwen3-4b)
Optimization Protocol=...
2026.05
31
28
10
69
Skill-R1 (Inference, Qwen3-4b)
Optimization Protocol=...
2026.05
22
21
8
51
Vanilla GRPO (Qwen3-4b)
Optimization Protocol=...
2026.05
21
24
4
49
GPT-4o-mini (no skills)
Optimization Protocol=...
2026.05
4
6
0
10
Feedback
Search any
task
Search any
task