Files in this item
Files  Description  Format 

application/pdf MAOTHESIS2021.pdf (1MB)  (no description provided) 
Description
Title:  Modelfree reinforcement learning in nonstationary Markov Decision Processes 
Author(s):  Mao, Weichao 
Advisor(s):  Başar, Tamer 
Department / Program:  Electrical & Computer Eng 
Discipline:  Electrical & Computer Engr 
Degree Granting Institution:  University of Illinois at UrbanaChampaign 
Degree:  M.S. 
Genre:  Thesis 
Subject(s):  reinforcement learning
Markov decision process nonstationarity 
Abstract:  Reinforcement learning (RL) studies the problem where an agent maximizes its cumulative reward through sequential interactions with an initially unknown environment, usually modeled by a Markov Decision Process (MDP). The classical RL literature typically assumes that the state transition functions and the reward functions of the MDP are timeinvariant. Such a stationary model, however, cannot capture the dynamic nature of many sequential decisionmaking problems in practice. In this thesis, we consider the problem of reinforcement learning in \emph{nonstationary} MDPs. In our setting, both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted QLearning with Upper Confidence Bounds (RestartQUCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. We theoretically show that RestartQUCB outperforms existing solutions in terms of dynamic regret, a notion commonly utilized to measure the performance of an online learning algorithm in a nonstationary environment. Specifically, RestartQUCB with Freedmantype bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further show that our algorithm is nearly optimal by establishing an informationtheoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, which to the best of our knowledge is the first impossibility result that characterizes the fundamental limits of nonstationary RL in general. To the best of our knowledge, RestartQUCB is the first modelfree algorithm for nonstationary RL. Compared with modelbased solutions, our algorithm is more time and spaceefficient, flexible, and compatible with the model deep RL architectures. We empirically evaluate RestartQUCB on RL tasks with both abrupt and gradual types of nonstationarity. Simulation results validate the advantages of RestartQUCB in terms of cumulative rewards and computational efficiency. We further demonstrate the power of our results through a ``learning to collaborate'' example in the context of multiagent RL, where nonstationarity is a key challenge. 
Issue Date:  20210409 
Type:  Thesis 
URI:  http://hdl.handle.net/2142/110451 
Rights Information:  Copyright 2021 Weichao Mao 
Date Available in IDEALS:  20210917 
Date Deposited:  202105 
This item appears in the following Collection(s)

Graduate Dissertations and Theses at Illinois
Graduate Theses and Dissertations at Illinois