Value-iteration based fitted policy iteration: learning with a single trajectory

Antos, András and Szepesvári, Csaba and Munos, Rémi (2007) Value-iteration based fitted policy iteration: learning with a single trajectory. In: ADPRL 07. 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning. Honolulu, 2007..

[img] Text
anszmu_sapi_adprl.pdf - Published Version
Restricted to Registered users only

Download (185kB)
[img] Text
sapi_adprl4aa.pdf - Published Version
Restricted to Registered users only

Download (221kB)

Abstract

ADPRL 2007. Honolulu, Hawaii, Apr 1-5, 2007. We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems when the training data is composed of the trajectory of some fixed behaviour policy. The algorithm studied is policy iteration where in successive iterations the action-value functions of the intermediate policies are obtained by means of approximate value iteration. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance. The bounds depend on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used. One of the main novelties of the paper is that new smoothness constraints are introduced thereby significantly extending the scope of previous results.

Item Type: Conference or Workshop Item (Paper)
Subjects: Q Science > QA Mathematics and Computer Science > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
Depositing User: Eszter Nagy
Date Deposited: 11 Dec 2012 15:26
Last Modified: 11 Dec 2012 15:26
URI: http://eprints.sztaki.hu/id/eprint/4404

Update Item Update Item