Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agentic Task Evaluation on VitaBench In-store
Loading...
34.62
Avg@2
Snapshot
17.6264
22.0382
26.45
30.8618
May 12, 2026
Avg@2
Pass@2
Updated 21d ago
Evaluation Results
Method
Method
Links
Avg@2
Pass@2
Snapshot
Backbone=Qwen3-30B-A3B
2026.05
34.62
50
PPO-EWMA
Backbone=Qwen3-30B-A3B
2026.05
33.41
48
Linear_prox
Backbone=Qwen3-30B-A3B
2026.05
31.47
47
Snapshot
Backbone=Qwen3-4B
2026.05
28.89
47
PPO-EWMA
Backbone=Qwen3-4B
2026.05
25
50
Linear_prox
Backbone=Qwen3-4B
2026.05
22.37
40
Decoupled PPO
Backbone=Qwen3-4B
2026.05
19.83
37
Decoupled PPO
Backbone=Qwen3-30B-A3B
2026.05
18.28
32
Feedback
Search any
task
Search any
task