ICML-98 Submission #192

An Analysis of Direct Reinforcement Learning in Non-Markovian Domains

Authors: Mark D. Pendrith
         Daimler-Benz Research and Technology
         1510 Page Mill Rd
         Palo Alto, CA 94304
   
               and
  
         Michael J. McGarity
         School of Electrical Engineering
         The University of New South Wales
         Sydney 2052 Australia

Abstract

It well-known that for Markov Decision Processes, the policies stable
under policy-iteration and the standard reinforcement learning methods
are exactly the optimal policies. In this paper, we investigate the
conditions for policy stability in the more general situation when the
Markov property cannot be assumed. We show that for a general class of
non-Markov decision processes, if actual return (Monte Carlo) credit
assignment is used with undiscounted returns, we are still guaranteed
the optimal observation-based policies will be equilibrium points in
the policy space when using the standard ``direct'' reinforcement
learning approaches. However, if either discounted rewards, or a
temporal differences style of credit assignment method is used, this
is not the case.

Keywords: Reinforcement learning, non-Markov Decision Processes,
          theoretical analysis
          
Contact: pendrith@rtna.daimlerbenz.com, (650) 845-2534