Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLMs Are In-Context Bandit Reinforcement Learners

About

Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.

Giovanni Monea, Antoine Bosselut, Kiant\'e Brantley, Yoav Artzi• 2024

Related benchmarks

TaskDatasetResultRank
Bandit Optimizationfnum nonlin
Cumulative Regret302.7
8
Bandit OptimizationfLLM
Cumulative Regret36
8
Bandit Optimizationnonlin2
Cumulative Regret43.7
8
Bandit Optimizationfextract
Cumulative Regret11.5
8
Bandit Optimizationfnum lin
Cumulative Regret190
8
Bandit Optimizationnonlin1
Cumulative Regret346.6
8
Showing 6 of 6 rows

Other info

Follow for update