Optimal Regret for Single Index Bandits

About

We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particularly relevant when the reward function is not known in advance. While optimal regret guarantees are known for monotone reward functions, the general non-monotone case remains poorly understood, with the best known bound being $\tilde{\mathcal{O}}(T^{3/4})$ (under standard boundedness and Lipschitz assumptions on the reward function [Kang et al., 2025]). We close this gap by establishing the optimal regret for general single-index bandits. We propose a simple two-phase algorithm, namely, Zoomed Single Index Bandit with Upper Confidence Bound ($\texttt{ZoomSIB-UCB}$), that first estimates the projection direction via a normalized Stein estimator, and then reduces the problem to a one-dimensional bandit using discretization and finally use UCB. This approach achieves a regret of $\tilde{\mathcal{O}}(T^{2/3})$, and improves significantly upon prior work without any additional assumptions. We also prove a matching minimax lower bound of $\tilde{\Omega}(T^{2/3})$, showing that the upper bound is essentially tight. Our upper and lower bounds together provide a sharp characterization of the regret in single-index bandits. Moreover, the empirical results further demonstrate the effectiveness and robustness of our approach.

Devdan Dey, Sujoy Bhore, Avishek Ghosh• 2026

Related benchmarks

Task	Dataset	Result
Cumulative regret minimization	Quadratic Link Function d=10 (30 trials)	Cumulative Regret2.09e+3	15
Cumulative regret minimization	Asymmetric Link Function d=10 (30 trials)	Cumulative Regret547	15
Cumulative regret minimization	Zigzag Link Function d=10 (30 trials)	Cumulative Regret1.21e+3	15
Single Index Bandit	Quadratic Function T=40,000	Cumulative Regret4.07e+3	15
Single Index Bandit	Asymmetric Function T=40,000	Cumulative Regret3.54e+3	15
Single Index Bandit	Zigzag Function T=40,000	Cumulative Regret1.53e+3	15
Cumulative regret minimization	f(z) = sin(2z)	Final Cumulative Regret1.01e+4	5
Cumulative regret minimization	f(z) = sin(2z) - 0.5z^2	Final Cumulative Regret1.64e+4	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord