Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-Armed Bandit (MAB) Horizon Generalization T=100
Loading...
22.37
Average Regret
Iterative RMFT
20.916
30.7305
40.545
50.3595
Nov 6, 2025
Average Regret
Updated 2d ago
Evaluation Results
Method
Method
Links
Average Regret
Iterative RMFT
Backbone=Qwen3-8B, Tea...
2025.11
22.37
AD_2IML
Backbone=Qwen3-8B, Tea...
2025.11
32.97
AD_IML
Backbone=Qwen3-8B, Tea...
2025.11
33.44
GRPO_step
Backbone=Qwen3-8B, Tea...
2025.11
35.88
GRPO_regret
Backbone=Qwen3-8B, Tea...
2025.11
40.09
Base
Backbone=Qwen3-8B, Tea...
2025.11
52.1
AD_IMk
Backbone=Qwen3-8B, Tea...
2025.11
58.72
Feedback
Search any
task
Search any
task