Policy Optimization in Robust Markov Decision Processes with Transition Gradient Theorem
 

Abstract: 

Reinforcement Learning (RL) is a powerful framework for sequential decision making. However, standard RL methods often struggle when the environment dynamics are uncertain, leading to poor performance in real-world applications such as autonomous navigation and robotic control. This limitation is a significant factor contributing to the lack of widespread adoption of RL-based control systems in industry.

To address is challenge, researchers introduced the robust Markov Decision Process (MDP), a sequential decision-making framework that explicitly models uncertainty in transition functions. Robust MDP aims to find a policy that consistently performs well across a range of possible transition functions. It has great potential for application in various domains, where the environment dynamics are uncertain or changing.Solving a robust MDP requires finding a policy that consistently performs well across a set of transition functions.

In this thesis, we model a robust MDP as a two-player game. The first player represents the policy, trained via standard policy optimization methods. The second player is an adversary that selects transition functions aimed at deteriorating the performance of the policy. A key contribution of this work is the transition gradient theorem, which enables effective training of the adversary by providing a structured way to optimize the transition functions. The two players are updated in an alternating fashion.

We validate the proposed approach in simple environments to demonstrate robustness and then scale up to complex robotic manipulation tasks. Our findings showcase the scalability and efficacy of robust MDP methods in handling real-world uncertainties, highlighting their potential for practical applications.

Committee: 

  • Chen-Yu Wei, Committee Chair (CS/SEAS/UVA)
  • Shangtong Zhang, Advisor (CS/SEAS/UVA)
  • Yen-Ling Kuo (CS/SEAS/UVA)