Abstract:
In many decision making scenarios the decision maker has access
only to limited information on the success/effect of his previous decisions.
For example, an advertiser can observe the success in sales of a certain
product, but can hardly infer on results if other strategies
were to be used. Another example is network routing, in which the trip time
along the chosen path can be measured, but little or no information is
available on other links in the network.
We study a more general framework of online linear optimization with
``bandit'' feedback. The existence of an efficient low-regret algorithm has
been an open question addressed in several recent papers. We present the first
known efficient algorithm with optimal regret bounds for bandit Online Linear
Optimization over arbitrary convex decision sets. We show how the
difficulties encountered by previous approaches are overcome by
exploiting the surprising exploration-exploitation properties of a self
concordant barrier function for the underlying decision set.
Joint work with Jacob Abernethy and Alexander Rakhlin @ UC Berkeley.