On the bonus Markov expression

Estimated read time: 4 min

Wireless

Reward is the driving force for reinforcement learning (RL) agents. Because of its central role in RL, reward is often assumed to be suitably general in its expression, as summed up by the Sutton and Littman reward hypothesis:

“…all that we mean by goals and objectives can be well thought of as maximization of the expected value of the cumulative sum of the numerical signal received (reward).”

– Sutton (2004), Littman (2017)

In our work, we are taking the first steps towards a systematic study of this hypothesis. To do this, we consider the following thought experiment involving Alice, the designer, and Bob, the learning agent:

Suppose Alice thinks of a task that she would like Bob to learn to solve – that task could be in the form of a natural language description (“balancing this pole”), an imagined state (“reaching any of the winning configurations of the chessboard”), or something More traditional as a function of reward or value. Next, we imagine Alice translating her choice of task into a generator that delivers a learning signal (such as a reward) to Bob (the learning agent), who will learn from that signal throughout his life. We then based our study of the reward hypothesis by addressing the following question: Given Alice’s choice of task, is there always a reward function that can pass that task on to Bob?

What is the task?

To make our study of this question concrete, we first limit our focus to three types of tasks. In particular, we offer three types of tasks that we believe capture reasonable types of tasks: 1) a set of accepted policies (SOAP), 2) a policy order (PO), and 3) a path order (TO). These three forms of tasks are concrete examples of the types of tasks we might want the agent to learn to solve.

We next examine whether the reward is able to capture each of these task types in confined environments. Crucially, we only focus attention on Markov reward functions; For example, given enough state space to form a task like (x,y) pairs in a gridded world, is there a reward function based solely on the state space itself that can capture the task?

First major finding

Our first main result shows that for each of the three task types, there are pairs of environment tasks for which there is no Markov reward function that can capture the task. An example of such a pair is the “Navigate Along the Grid Clockwise or Counterclockwise” task in a typical grid world:

This task is naturally captured by SOAP which consists of two accepted policies: the “clockwise” policy (in blue) and the “counterclockwise” policy (in purple). For the Markov reward function to express this assignment, you would need to make these two policies precisely higher in value than all other deterministic policies. However, there is no Markov reward function: the optimization of a single “moving clockwise” action depends on whether the agent actually moved in that direction in the past. Since the reward function must be Markov, it cannot transmit this type of information. Similar examples show that Markov reward cannot capture all policy orders and path order either.

The second major finding

Since some tasks can be captured and others cannot, we next explore whether there is an effective procedure to determine if a particular task can be captured by reward in a given environment. Furthermore, if there is a reward function that captures the assigned task, we’d ideally like to be able to output that reward function. Our second result is a positive one saying that for any pair of finite-environment tasks, there is a procedure that can 1) determine whether the task can be captured by a Markov reward in a given environment, and 2) output the desired reward function that moves exactly the task, when such are present. Function.

This work identifies preliminary pathways toward understanding the scope of the reward hypothesis, but much remains to be done to generalize these findings beyond finite environments, Markov rewards, and the simple notions of ‘task’ and ‘expressive’. We hope that this work will provide new conceptual perspectives on reward and its place in reinforcement learning.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.