On the bonus Markov expression

Wireless

Reward is the driving force for reinforcement learning (RL) agents. Because of its central role in RL, reward is often assumed to be suitably general in its expression, as summed up by the Sutton and Littman reward hypothesis:

“…all that we mean by goals and objectives can be well thought of as maximization of the expected value of the cumulative sum of the numerical signal received (reward).”
‍
– Sutton (2004), Littman (2017)

In our work, we are taking the first steps towards a systematic study of this hypothesis. To do this, we consider the following thought experiment involving Alice, the designer, and Bob, the learning agent:

Suppose Alice thinks of a task that she would like Bob to learn to solve – that task could be in the form of a natural language description (“balancing this pole”), an imagined state (“reaching any of the winning configurations of the chessboard”), or something More traditional as a function of reward or value. Next, we imagine Alice translating her choice of task into a generator that delivers a learning signal (such as a reward) to Bob (the learning agent), who will learn from that signal throughout his life. We then based our study of the reward hypothesis by addressing the following question: Given Alice’s choice of task, is there always a reward function that can pass that task on to Bob?

What is the task?

To make our study of this question concrete, we first limit our focus to three types of tasks. In particular, we offer three types of tasks that we believe capture reasonable types of tasks: 1) a set of accepted policies (SOAP), 2) a policy order (PO), and 3) a path order (TO). These three forms of tasks are concrete examples of the types of tasks we might want the agent to learn to solve.

We next examine whether the reward is able to capture each of these task types in confined environments. Crucially, we only focus attention on Markov reward functions; For example, given enough state space to form a task like (x,y) pairs in a gridded world, is there a reward function based solely on the state space itself that can capture the task?

First major finding

Our first main result shows that for each of the three task types, there are pairs of environment tasks for which there is no Markov reward function that can capture the task. An example of such a pair is the “Navigate Along the Grid Clockwise or Counterclockwise” task in a typical grid world:

This task is naturally captured by SOAP which consists of two accepted policies: the “clockwise” policy (in blue) and the “counterclockwise” policy (in purple). For the Markov reward function to express this assignment, you would need to make these two policies precisely higher in value than all other deterministic policies. However, there is no Markov reward function: the optimization of a single “moving clockwise” action depends on whether the agent actually moved in that direction in the past. Since the reward function must be Markov, it cannot transmit this type of information. Similar examples show that Markov reward cannot capture all policy orders and path order either.

The second major finding

Since some tasks can be captured and others cannot, we next explore whether there is an effective procedure to determine if a particular task can be captured by reward in a given environment. Furthermore, if there is a reward function that captures the assigned task, we’d ideally like to be able to output that reward function. Our second result is a positive one saying that for any pair of finite-environment tasks, there is a procedure that can 1) determine whether the task can be captured by a Markov reward in a given environment, and 2) output the desired reward function that moves exactly the task, when such are present. Function.

This work identifies preliminary pathways toward understanding the scope of the reward hypothesis, but much remains to be done to generalize these findings beyond finite environments, Markov rewards, and the simple notions of ‘task’ and ‘expressive’. We hope that this work will provide new conceptual perspectives on reward and its place in reinforcement learning.

Source link

Techspiro5

On the bonus Markov expression

What is the task?

First major finding

The second major finding

Post a Comment

How to Get Canva Pro for Free?

AI can perform 1 million microbial experiments a year - ScienceDaily

Should I upgrade my 3D printer to a faster one? Not so fast

Think Monetized Kids News and other VC news

Seagate's expensive Xbox Storage expansion cards are finally getting a price cut

Ahmed Haroud