שניר כהן, הרצאה סמינריונית למגיסטר
יום רביעי, 25.10.2017, 14:00
Optimistic methods for solving Reinforcement Learning problems are very popular in the literature. In practice, however, these methods show inferior performance compared to other methods, such as Posterior Sampling.
We propose a novel concept of Restricted Optimism to balance the well known exploration vs. exploitation trade-off for finite-horizon MDPs. We harness Posterior Sampling to construct two algorithms in the spirit of our Restricted Optimism principle. We provide theoretical guarantees for them and demonstrate through experiments that there exists a trade-off between the average cumulative regret suffered by the agent and the variance. The agent can influence this trade-off by tuning the level of optimism carried out by our proposed algorithms through a regularization parameter.