Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Optimal Regret for Single Index Bandits

About

We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particularly relevant when the reward function is not known in advance. While optimal regret guarantees are known for monotone reward functions, the general non-monotone case remains poorly understood, with the best known bound being $\tilde{\mathcal{O}}(T^{3/4})$ (under standard boundedness and Lipschitz assumptions on the reward function [Kang et al., 2025]). We close this gap by establishing the optimal regret for general single-index bandits. We propose a simple two-phase algorithm, namely, Zoomed Single Index Bandit with Upper Confidence Bound ($\texttt{ZoomSIB-UCB}$), that first estimates the projection direction via a normalized Stein estimator, and then reduces the problem to a one-dimensional bandit using discretization and finally use UCB. This approach achieves a regret of $\tilde{\mathcal{O}}(T^{2/3})$, and improves significantly upon prior work without any additional assumptions. We also prove a matching minimax lower bound of $\tilde{\Omega}(T^{2/3})$, showing that the upper bound is essentially tight. Our upper and lower bounds together provide a sharp characterization of the regret in single-index bandits. Moreover, the empirical results further demonstrate the effectiveness and robustness of our approach.

Devdan Dey, Sujoy Bhore, Avishek Ghosh• 2026

Related benchmarks

TaskDatasetResultRank
Cumulative regret minimizationQuadratic Link Function d=10 (30 trials)
Cumulative Regret2.09e+3
15
Cumulative regret minimizationAsymmetric Link Function d=10 (30 trials)
Cumulative Regret547
15
Cumulative regret minimizationZigzag Link Function d=10 (30 trials)
Cumulative Regret1.21e+3
15
Single Index BanditQuadratic Function T=40,000
Cumulative Regret4.07e+3
15
Single Index BanditAsymmetric Function T=40,000
Cumulative Regret3.54e+3
15
Single Index BanditZigzag Function T=40,000
Cumulative Regret1.53e+3
15
Cumulative regret minimizationf(z) = sin(2z)
Final Cumulative Regret1.01e+4
5
Cumulative regret minimizationf(z) = sin(2z) - 0.5z^2
Final Cumulative Regret1.64e+4
5
Showing 8 of 8 rows

Other info

Follow for update