On Kernelized Multi-armed Bandits
About
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy63.9 | 756 | |
| Question Answering | CommonsenseQA | Accuracy81.1 | 143 | |
| Question Answering | TruthfulQA | Accuracy76.1 | 73 | |
| Question Answering | ARC (test) | Accuracy63.4 | 67 | |
| Question Answering | TriviaQA Gen (test) | Accuracy74.7 | 31 | |
| Bayesian Optimization | 50 optimization problems COCO, BoTorch, Bayesmark (aggregated) | Mean RP2.07 | 26 | |
| Bayesian Optimization | Rosenbrock-NS synthetic (test) | Computation Time (s)2.69e+3 | 5 | |
| Bayesian Optimization | HPOBench | Computation Time (s)3.34e+3 | 5 | |
| Bayesian Optimization | Ackley-HT synthetic (test) | Computation Time (s)2.77e+3 | 5 | |
| Bayesian Optimization | Ackley-NS synthetic (test) | Computation Time (s)2.80e+3 | 5 |