Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates
About
We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only $O(\log\log T)$ parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in $T$) simultaneously in both the small-$K$ and large-$K$ regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step -- a dominant computational bottleneck in prior strictly batched static-grid methods -- yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under $O(\log\log T)$ parameter updates that are also computationally efficient in practice.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Linear Contextual Bandits | Synthetic Linear Bandit K=5000, d=10 | Average Runtime (s)6.19 | 12 | |
| Linear Contextual Bandits | Synthetic Linear Bandit K=50, d=20 | Average Runtime (s)1.06 | 6 | |
| Linear Contextual Bandits | Synthetic Linear Bandit (K=100, d=30) | Average Runtime (s)1.62 | 6 | |
| Linear Contextual Bandits | Normal Contexts K=50, d=20 synthetic (test) | Average Runtime (s)1.05 | 6 | |
| Linear Contextual Bandits | Normal Contexts K=100, d=30 synthetic (test) | Average Runtime (s)1.41 | 6 | |
| Linear Contextual Bandits | Synthetic Linear Bandit K=1000, d=5 | Average Runtime (s)5.91 | 6 | |
| Linear Contextual Bandits | Normal Contexts K=1000, d=5 synthetic (test) | Average Runtime (s)2.17 | 6 | |
| Linear Contextual Bandit | Uniform distribution | Average Runtime (s)3.62 | 4 | |
| Linear Contextual Bandit | Normal distribution | Average Runtime (seconds)3.88 | 4 | |
| Regret Minimization | Generalized Linear Contextual Bandits | Worst-Case Regret2 | 2 |