Share your thoughts, 1 month free Claude Pro on usSee more

Multi-Armed Bandit (MAB) Horizon Generalization T=100

22.37Average Regret

Iterative RMFT

Updated 1mo ago

Evaluation Results

Method	Links
Iterative RMFT 2025.11		22.37
AD_2IML 2025.11		32.97
AD_IML 2025.11		33.44
GRPO_step 2025.11		35.88
GRPO_regret 2025.11		40.09
Base 2025.11		52.1
AD_IMk 2025.11		58.72