ICML-98 Submission #29
Title: Finite-time Regret Bounds for the Multiarmed Bandit Problem
Nicolo Cesa-Bianchi
DSI, University of Milan
via Comelico 39, 20315 Milano, Italy
cesabian@dsi.unimi.it
Paul Fischer
Lehrstuhl Informatik II
Universitaet Dortmund
D-44221 Dortmund, Germany
paulf@ls2.informatik.uni-dortmund.de
extended abstract
Abstract:
We show finite-time regret bounds for the multiarmed bandit problem under
the assumption that all rewards come from a bounded and fixed range. Our
regret bounds after any number $T$ of pulls are of the form
$a + b\log T + c\log^2 T$ where $a$, $b$, and $c$ are positive constants
not depending on $T$. In particular, these bounds are shown to hold
for variants of the popular $\ve$-greedy and softmax allocation rules,
and for a new simple deterministic allocation rule. Moreover, our results
also apply to a variant of the bandit problem where reward distributions
can depend, to some extent, from previous pulls and observed rewards.
Finally, we discuss the empirical performance of our algorithms with respect
to specific choices of the reward distribution.
Keywords: Bandit problems, Adaptive allocation rules, Finite horizon regret.