Online Markov decision processes under bandit feedback

Neu, Gergely and György, András and Szepesvári, Csaba and Antos, András (2010) Online Markov decision processes under bandit feedback. In: NIPS 2010. Neural Information Processing Systems Foundation proceedings. Vancouver, 2010..

[img] Text
NIPS2010_1311.pdf - Published Version
Restricted to Registered users only

Download (238kB)


We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^(2/3)(ln T)^(1/3)), giving the first rigorously proved regret bound for the problem.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: online learning, Markov decision process, bandit feedback
Subjects: Q Science > QA Mathematics and Computer Science > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
Depositing User: Eszter Nagy
Date Deposited: 12 Dec 2012 08:38
Last Modified: 12 Dec 2012 08:38

Update Item Update Item