Monitored Markov Decision Processes


In RL, an agent learns by interacting with an environment and receiving rewards for its actions. However, the assumption that rewards are always observable is often not applicable in real-world problems. Consider the situation where the agent is tasked with household chores, and the quality of its behavior is provided through rewards from the homeowner and smart sensors. What if rewards are not always observable, e.g., if the owner is not present or the sensors are malfunctioning? In such situations, the agent should not interpret the lack of reward as meaning that all behavior is equally desirable. Neither should it think that avoiding monitoring or intentionally damaging sensors is an effective way to avoid negative feedback.

In other words, there are cases where the environment generates rewards in response to the agent's actions but the agent cannot observe them. How should the agent behave in these situations? Can it learn to seek rewards even if their observability is not fully under its control? Can it learn optimal policies even when some rewards are never observable?


In a preliminary work presented at AAMAS, we formalized a novel but general RL framework — Monitored MDPs — where the agent cannot always observe rewards. Instead, the agent observes proxy rewards provided by a separate MDP the — monitor. We discussed the theoretical and practical consequences of this setting, showed challenges raised even in toy environments, and proposed algorithms to begin to tackle this novel setting. This paper introduced a powerful new formalism that encompasses both new and existing problems, and laid the foundation for future research. In a follow-up NeurIPS paper, we presented a novel exploration paradigm to overcome the limitations of classic optimistic strategies, known to fail under partial observability of rewards. Although very general and applicable to MDPs as well, our directed exploration proved to be especially effective in Mon-MDPs, setting the stage for future exploration methods.

References

Beyond Optimism: Exploration With Partially Observable Rewards
Simone Parisi, Alireza Kazemipour, Michael Bowling
Neural Information Processing Systems (NeurIPS), 2024

Monitored Markov Decision Processes
Simone Parisi, Montaser Mohammedalamen, Alireza Kazemipour, Matthew E. Taylor, Michael Bowling
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2024


Learning and Transfer of Exploration Policies

Classic RL exploration is task-driven (the extrinsic reward is the main drive of exploration) or makes use of myopic intrinsic rewards (they depends the most promising short-term actions). Furthermore, learning follows a tabula-rasa approach — the agent learns from scratch, assuming isolated environments and no prior knowledge or experience.

How can we formulate deep long-term exploration policies? Can we learn to explore environments before training the agent to solve tasks? How can we mimic the human behavior — to explore multiple environment in a lifelong process, driven by just the inherently interestingness of the world?

In a preliminary work published in Algorithms, we presented a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions.

In a NeurIPS paper, we proposed a novel framework to pre-train and transfer exploration in a task-agnostic manner. The agent first learns to explore across many environments, driven only by a novel intrinsic rewards and without any extrinsic goal. Later on, the agent transfers the learned exploration policy to better explore new environments when solving tasks. The key idea of our framework is that there are two components of exploration: (1) an agent-centric component encouraging exploration of unseen parts of the environment based on an agent's belief; (2) an environment-centric component encouraging exploration of inherently interesting objects.

The results of our approach were extremely promising, and open several avenues of research. Is there a universal formulation of intrinsic curiosity and interestingness? How important are state representation to learn / transfer exploration policies? Can we use out-of-domain data — e.g., collected from the internet — to train exploration policy?

References

Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning
Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Algorithms, 2022

Interesting Object, Curious Agent: Learning Task-Agnostic Exploration
Simone Parisi, Victoria Dean, Deepak Pathak, Abhinav Gupta
Neural Information Processing Systems (NeurIPS), 2021


Learning and Transfer of State Representations


In computer vision and natural language processing, recent advances have allowed to exploit massive amounts of data to pre-train perception models. These models can be successfully used "off-the-shelf" to solve many different downstream applications without any further training. On the contrary, in RL many algorithms still follow a "tabula-rasa" paradigm where the agent performs millions or even billions of in-domain interactions with the environment to learn task-specific visuo-motor policies from scratch.

Can we instead train a single near-universal vision model — a model pre-trained entirely on out-of-domain data that works for nearly any RL task?

In a preliminary work presented at ICML, we studied well-known pre-trained vision models in the context of control. Are supervised models better than self-supervised models? What kind of invariances are relevant for the perception module of the control policy? Is the feature hierarchy of the vision layers important for control?
By investigating these fundamental questions, we succeeded at making a single off-the-shelf vision model — trained on out-of-domain datasets — to be competitive with or even outperform ground-truth features on all the four control domains. As efficient compact state features are hard to estimate in unstructured real-world environments and the agent needs to rely on raw vision input, our model can be extremely beneficial by dramatically reducing the data requirement and improving the policy performance.

References

The (Un)Surprising Effectiveness of Pre-Trained Vision Models for Control
Simone Parisi*, Aravind Rajeswaran*, Senthil Purushwalkam, Abhinav Gupta
International Conference on Machine Learning (ICML), 2022


Improving Actor-Critic Stability

Actor-critic methods can achieve incredible performance but they are also prone to instability. This is partly due to the interaction between the actor and the critic during learning: an inaccurate step taken by one might adversely affect the other and destabilize the learning.

How can we make any actor-critic method more stable, especially when only few training samples are available?

Our novel method, TD-regularized actor-critic (TD-REG), regularizes the actor loss by penalizing the TD-error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate. TD-REG is a simple plug-and-play approach to improve stability and overall performance of any actor-critic method.

References

TD-Regularized Actor-Critic Methods
Simone Parisi, Voot Tangkaratt, Jan Peters, Mohammad Emtiyaz Khan
Machine Learning, 2019


RL Applied to Robotics: The Tetherball Platform

Motor skills learning is an important challenge in order to endow robots with the ability to learn a wide range of skills and solve complex tasks. In the last decade, RL has been shown the ability to acquire a variety of skills, ranging from the game ball-in-a-cup to walking and jumping.

However, comparing RL against human programming is not straightforward. RL policies try to optimize a given reward function, but using this reward function as comparison measure would introduce a bias, as there is no guarantee that the manual program maximizes the same reward function.

To address the problem of finding a fair evaluation measure, we proposed a robotic task based on the game of tetherball. In a game, in fact, there is a pre-defined success measure: the game score. The robotic platform consisted of two cable-driven lightweight robots capable of highly dynamic behavior due to springs between the motors and the cables which drive the joints. We manually programmed one robot player using the complete model of the tetherball game, while we trained the other player using RL. Evaluated on real games, the RL player outperformed the manually programmed one by winning more often. By learning by trial-and-error, in fact, the RL player could compensate for the prediction error of the highly nonlinear forward dynamics model, and errors in the ball tracking.

References

Reinforcement Learning vs Human Programming in Tetherball Robot Games
Simone Parisi, Hany Abdulsamad, Alexandros Paraschos, Christian Daniel, Jan Peters
International Conference on Intelligent Robots and Systems (IROS), 2015


Multi-Objective RL


Many real-world control applications are characterized by the presence of multiple conflicting objectives. In these problems, the goal is to find the Pareto frontier, a set of policies representing different compromises among the objectives.

The Pareto frontier encapsulates all the trade-offs among the objectives and gives better insight into the problem, thus helping the a posteriori selection of the most favorable solution. How can we efficiently build an approximation of the frontier that contains solutions that are accurate, diverse, and evenly distributed?

My research on this topic has focused on manifold-based methods that return a continuous approximation of the Pareto frontier: the idea is to optimize the parameters of a function defining a manifold in the policy parameters space, so that the corresponding image in the objectives space gets as close as possible to the true Pareto frontier. This allows to learn an approximation in a single pass instead of using multiple optimizations. In a preliminary work, we presented a gradient descent method and investigated its sample complexity and the effects of its hyperparameters. Later, we improved it by using episodic exploration strategies, importance sampling, and novel losses to assess the quality of a Pareto frontier approximation.

References

Manifold-based Multi-objective Policy Search with Sample Reuse
Simone Parisi, Matteo Pirotta, Jan Peters
Neurocomputing, 2017

Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation
Simone Parisi, Matteo Pirotta, Marcello Restelli
Journal of Artificial Intelligence Research (JAIR), 2016

Policy Gradient Approaches for Multi-Objective Sequential Decision Making
Simone Parisi, Matteo Pirotta, Nicola Smacchia, Luca Bascetta, Marcello Restelli
International Joint Conference on Neural Networks (IJCNN), 2014