ICML-98 Submission #29

Title: Finite-time Regret Bounds for the Multiarmed Bandit Problem

Nicolo Cesa-Bianchi
DSI, University of Milan
via Comelico 39, 20315 Milano, Italy
cesabian@dsi.unimi.it

Paul Fischer
Lehrstuhl Informatik II
Universitaet Dortmund
D-44221 Dortmund, Germany
paulf@ls2.informatik.uni-dortmund.de

extended abstract Abstract:

    We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number $T$ of pulls are of the form $a + b\log T + c\log^2 T$ where $a$, $b$, and $c$ are positive constants not depending on $T$. In particular, these bounds are shown to hold for variants of the popular $\ve$-greedy and softmax allocation rules, and for a new simple deterministic allocation rule. Moreover, our results also apply to a variant of the bandit problem where reward distributions can depend, to some extent, from previous pulls and observed rewards. Finally, we discuss the empirical performance of our algorithms with respect to specific choices of the reward distribution.

Keywords: Bandit problems, Adaptive allocation rules, Finite horizon regret.