Interactive Online Learning with Incomplete Knowledge
The past decades have witnessed a prominent trend of adopting intelligent systems, such as recommendation systems and smart homes, into ordinary people’s daily life. One key characteristic of such systems is the need for online sequential decision making: decisions have to be made when the learning agent only has incomplete knowledge about the world/environment. The consequences of such decisions will, in turn, contribute to the data the agent can collect, forming an interactive feedback loop between the agent and the world/environment. This makes conventional offline training based machine learning methods incompetent and urges us to move from the passive learning paradigm to a more interactive and proactive one. It motivates the research of developing interactive online learning solutions, such as contextual bandits, and more generally reinforcement learning, for real-world systems.
Interactive online learning studies how an agent can interact with an environment to learn a policy that maximizes expected cumulative rewards for a task. In a real-world intelligent system, the learning agent faces environments that consist of human users. This brings at least two significant challenges in developing interactive online learning solutions. First, to capture user heterogeneity, personalized learning solutions are needed. However, the sparsity of each individual user's observation, especially for new users makes the learning process very slow. Second, many real-world systems are highly dynamic, which is reflected in the fact that users' preferences change over time due to various internal or external factors, and item popularity varies due to fast-emerging events and contents. Failing to model such dynamics may lead to sub-optimal decisions.
These two fundamental challenges motivate research in this dissertation. We conquer the first challenge by leveraging the existence of dependency among the users. Specifically, we develop a series of collaborative contextual bandit learning solutions in which information can be propagated through explicit or implicit dependency among users. This information propagation helps conquer the data sparsity issue and accelerate the personalized learning process. Rigorous theoretical guarantees are developed, which reveals the benefit of collaboration in the learning process when user dependencies do exist. We conquer the second challenge by moving beyond a commonly used but restrictive stationary environment assumption to a more a realistic non-stationary one. We develop a suite of novel and theoretically sound contextual bandit solutions that automatically detects the potential changes in the environment and adapts its decision making strategy accordingly. Solutions developed this dissertation has been applied to a broad spectrum of recommendation systems, showing their great practical potential.
- David Evans (Chair)
- Hongning Wang (Advisor)
- Quanquan Gu
- Lihong Li (Google Brain)
- Denis Nekipelov (Minor Representative)