Is curiosity all you need? On the usefulness of behaviors arising from curious exploration

Estimated read time: 5 min

Wireless

During an intriguing exploration, JACO’s arm discovers how to pick up and move cubes around the workspace and even explores whether their edges can be offset.

Curious Exploration enables the OP3 to walk upright, balance on one foot, sit up, and even safely catch itself when jumping backwards—all without a specific goal task to optimize for.

Intrinsic motivation (1, 2) could be a powerful concept for giving the agent a mechanism to continuously explore its environment in the absence of task information. One common way to implement self-motivation is learning through curiosity (3, 4). Using this method, a predictive model is trained about the environment’s response to the agent’s actions along with the agent’s policy. This model can also be called the universal model. When an action is taken, the global model predicts the agent’s next observation. This prediction is then compared to the real observation made by the agent. Crucially, the agent’s reward for taking this action is measured by the error it made in anticipating the next observation. In this way, the agent is rewarded for taking actions whose outcome is not yet predictable. At the same time, the global model is being updated to better predict the outcome of said work.

This mechanism has been successfully applied in policy settings, for example to beat 2D computer games in an unmoderated manner (4) or to train a public policy that can be easily adapted to concrete end-tasks (5). However, we believe that the real strength of Curiosity learning lies in the varied behavior that emerges throughout the curious exploration process: as the Curiosity target changes, so does the resulting behavior of the agent and thus discover many complex policies that can be used later, if they are kept and not overwritten.

In this paper, we make two contributions to the study of curiosity learning and harnessing its emergent behaviour: first, we make SelMo, which is an extrapolation realization of the method of exploration based on self-motivation and curiosity. We show that with SelMo, purposeful and diversified behavior emerges solely based on curiosity-target optimization in the domains of simulated manipulation and movement. Second, we propose to broaden the focus in the application of curiosity learning towards the identification and retention of emergent intermediate behaviors. We support this conjecture with an experiment that reloads self-discovered behaviors as pretrained adjunctive skills in a hierarchical reinforcement learning setting.

SelMo method control flow: the agent (actor) collects the routes in the environment with its current policy and stores them in the replay buffer of the model on the left. The connected world model regularly samples that buffer and updates its parameters for forward prediction using stochastic gradient descent (SGD). Curiosity rewards are assigned to the sampled pathways measured by their prediction error according to the current global model. The categorized routes are then passed to the policy replay buffer on the right. Maximum post policy optimization (MPO) (6) is used to fit the Q function and the policy based on samples from the policy remodeling. The resulting and updated policy is synced back to the actor.

We operate SelMo in two simulated continuous control robotic domains: on a JACO 6-DoF arm with a three-finger clutch and on a 20-DoF humanoid robot, OP3. The respective platforms offer challenging learning environments for object and movement manipulation, respectively. While optimizing solely for the sake of curiosity, we observe the emergence of complex, human-interpretable behavior over the course of training sessions. For example, JACO learns to pick up and move blocks without any supervision or OP3 learns to balance on one foot or to sit securely without falling over.

Example of JACO and OP3 training schedules. While optimizing the curiosity target, it exhibits complex and purposeful behavior in both manipulation and movement settings. The full videos can be found at the top of this page.

However, the interesting behaviors observed during curious exploration have one critical drawback: they are not static because they constantly change with the function of rewarding curiosity. As the agent continues to repeat a certain behavior, such as JACO raising a red cube, the curiosity rewards accrued through the policy diminish. Thus, this leads to modified policy learning that again gains higher curiosity rewards, such as moving the cube out of the workspace or even paying attention to the other cube. But this new behavior supersedes the old. However, we believe that retention of behaviors arising from curious exploration provides the agent with a valuable skill set to learn new tasks more quickly. In order to investigate this conjecture, we set up an experiment to check the usefulness of self-discovered skills.

We treat random shots taken from different stages of curious exploration as adjunctive skills in a standardized educational framework (7) and measure how quickly a new target skill is learned using those adjuncts. In the case of the JACO arm, we set the target task as “raise the red cube” and used five randomly sampled self-discovered behaviors as aids. We compare learning of this final task with baseline SAC-X (8) which uses a reward functions approach to reward reaching and moving the red cube which ultimately facilitates learning to lift as well. We find that even this simple setting of skill reuse actually accelerates learning progression to the final task commensurate with the handcrafted reward approach. The results indicate that automatic identification and retention of beneficial emergent behavior from curious exploration is a fruitful avenue for future investigation of unsupervised reinforcement learning.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.