Able agents generally arise from open play

Estimated read time: 10 min

Wireless

In recent years, AI agents have been successful in a range of complex game environments. For example, AlphaZero beat world champion programs in chess, shogi, and Go after starting out with knowing no more than the basic rules of how to play. Through Reinforcement Learning (RL), learn this single system by playing round after round through an iterative process of trial and error. But AlphaZero still trains separately for each game – simply unable to learn another game or task without repeating the RL process from scratch. The same is true of RL’s other hits, such as Atari, Capture the Flag, StarCraft II, Dota 2, and Hide-and-Seek. DeepMind’s mission of solving intelligence to advance science and humanity prompted us to explore how we could overcome this limitation to create AI agents with more general and adaptive behavior. Instead of learning one game at a time, these customers will be able to respond to completely new conditions and play a whole world of games and tasks, including those they have never seen before.

Today, we published Open Learning Leads to Generally Capable Agents, a preprint detailing our first steps to training an agent capable of playing many different games without the need for human interaction data. We created a vast gaming environment that we call XLand, which includes many multiplayer games within consistent, human-like 3D worlds. This environment enables the formulation of new learning algorithms, which dynamically control how the agent is trained and what games it is trained on. An agent’s abilities improve frequently in response to challenges that arise in training, with the learning process continually improving training tasks so that the agent never stops learning. The result is an agent with the ability to succeed in a wide range of tasks – from simple object-finding problems to complex games like hide-and-seek and capture-the-flag, not encountered during training. We find that the agent displays general heuristic behaviors such as experimentation and behaviors that can be broadly applied to many tasks rather than specializing in a single task. This new approach represents an important step towards creating more generic agents with the flexibility to quickly adapt in ever-changing environments.

The agent plays a variety of test tasks. The agent has been trained on a wide variety of games and as a result is able to generalize to test games that he has never seen before in training.

A world of training tasks

The lack of training data—where the ‘data’ points are different tasks—was a major limiting factor in the behavior of trained RL agents generally enough to be applied across games. Without being able to train the agents on a sufficiently wide range of tasks, the RL-trained agents were unable to adapt their learned behaviors to the new tasks. But by designing a simulation space to allow for procedurally generated tasks, our team created a way to train on and generate experience from programmatically generated tasks. This allows us to include billions of missions in XLand across diverse games, worlds and players.

Our AI agents inhabit 3D first-person avatars in a multiplayer environment intended to simulate the physical world. Players feel their surroundings by observing RGB images and receiving a text description of their target, and practice a range of games. These games are as simple as cooperative games of finding objects and navigating worlds, where the player’s goal can be to “be near the purple cube”. More complex games can be based on choosing from multiple reward options, such as “be near the purple cube or put the yellow ball on the red floor”, and more competitive games involve playing against participating players, such as the matching game of hide and seek where each player has The goal is, “See the opponent and make the opponent not see me.” Each game determines the rewards for the players, and the ultimate goal of each player is to maximize the rewards.

Because XLand can be selected programmatically, the game space allows data to be generated in an automated and algorithmic manner. And since missions in XLand involve multiplayer, the behavior of the players involved greatly affects the challenges the AI ​​agent faces. These complex, nonlinear interactions create an ideal source of data to train with, because sometimes small changes in the components of the environment can lead to big changes in the challenges facing agents.

XLand consists of a galaxy of games (seen here as dots embedded in 2D, colored and sized based on their properties), with each game playable in many different simulated worlds whose topology and properties vary seamlessly. An example of an XLand mission that combines a game with a world and co-players.

Training methods

The focus of our research is the role of deep RL in training neural networks for our agents. The neural network architecture we use provides an attention mechanism on the agent’s internal recurring state – helping to orient the agent’s attention with estimates of sub-objectives unique to the game the agent is playing. We found that this GOAT agent learns more capable policies overall.

We also explored the question, what training assignment distribution would produce the best possible agent, especially in such a vast environment? The dynamic task generation we use enables continuous changes to the distribution of agent training tasks: each task is created so that it is neither too difficult nor too easy, but suitable for training. We then use population-based training (PBT) to adjust the criteria for creating dynamic tasks based on fitness aimed at improving the agents’ general ability. Finally, we chain several training courses together so that each generation of agents can boot to the previous generation.

This leads to a final training process with deep RL at the core updating neural networks for agents with each experience step:

  • Experience steps come from training tasks that are generated dynamically in response to the agents’ behavior,
  • The task-generating functions of agents change in response to the agent’s relative performance and strength,
  • In the outer ring, generations of agents booting from each other provide players with a richer co-op environment, redefining the measure of progression itself.

The training process starts from scratch and iteratively builds complexity, constantly changing the learning problem to keep the agent learning. The iterative nature of a blended learning system, which improves not a limited measure of performance, but rather a recursively defined range of general ability, leads to an open-ended learning process for agents, limited only by the expression of the environment space and the agent neural network.

The agent learning process consists of dynamics on multiple time scales.

measure progress

To measure the performance of agents in this vast universe, we create a set of assessment tasks using games and worlds kept separate from the data used for training. These “holding” tasks include specifically human-designed tasks such as hiding, searching, and capturing the flag.

Because of the size of the XLand, understanding and characterizing our dealer’s performance can be a challenge. Each mission includes different levels of complexity, different measures of achievable rewards, and different abilities of the agent, so simply averaging the reward over outstanding tasks would mask the actual differences in complexity and rewards – and would effectively treat all tasks as equally interesting. , which does not necessarily apply to procedurally generated environments.

To get around these limitations, we take a different approach. First, we normalize the scores for each task using a Nash equilibrium value calculated using our current set of trained players. Second, we consider the entire distribution of normative scores—instead of looking at the mean of the normative scores, we look at the different percentages of the normative scores—plus the percentage of tasks in which the agent scores at least one rewarding step: engagement. This means that an agent is not considered better than another agent unless it outperforms on all percentages. This approach to measurement gives us a useful way to assess our agents’ performance and strength.

Agents are generally able

After training our agents for five generations, we’ve seen consistent improvements in learning and performance across our evaluation space. Playing nearly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the latest generation has encountered 200 billion training steps as a result of 3.4 million unique missions. At this time, our agents were able to participate in every procedurally generated assessment task except for a handful that were impossible even for a human. And the results we see clearly show an overall hit-free behavior across the task space – with the limits of normal hit percentages constantly improving.

Learning progression shows our latest generation of agents how our test metrics progress through time, and also translates to pointless performance on manually authored test tasks.

Looking at our agents qualitatively, we often see general prescriptive behaviors emerge—rather than specific, highly optimized behaviors for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they reach a rewarding state. We also see agents relying on other tools, including objects, to block vision, create ramps, and retrieve other objects. Because the environment is multiplayer, we can examine the evolution of agent behaviors during training for chronic social dilemmas, such as a game of ‘chicken’. As training progresses, our agents seem to exhibit more cooperative behavior when playing with a version of themselves. Because of the nature of the environment, it’s hard to pinpoint intention — the behaviors we see often seem incidental, but we still see them happen constantly.

Above: What types of behavior are shown? (1) Operators demonstrate the ability to switch the option they choose as the tactical situation evolves. (2) Agents are shown glimpses of tool use, such as creating ramps. (3) Agents learn a general trial-and-error behavior, stopping when they recognize the correct state found. Below: Multiple ways the same agents can use objects to reach the target purple pyramid in this handcrafted investigation mission.
Multiple ways in which the same agents can use objects to reach the purple pyramid in this handcrafted investigation mission.
Multiple ways in which the same agents can use objects to reach the purple pyramid in this handcrafted investigation mission.
Multiple ways in which the same agents can use objects to reach the purple pyramid in this handcrafted investigation mission.

By analyzing the agent’s internal representations, we can say that by taking this approach to reinforcement learning in a vast task space, our agents are aware of the fundamentals of their bodies and the passage of time and that they understand the higher-level structure of the games they encounter. Perhaps most interestingly, they are clearly aware of reward situations in their environment. This generalization and diversity of behavior in the novel tasks suggests that these agents can be tuned into the final tasks. For example, we show in the technical paper that with only 30 min of focused training on a newly presented complex task, agents can quickly adapt, whereas agents trained on RL from scratch cannot learn these tasks at all.

With the development of an environment like XLand and new training algorithms that support the open generation of complexity, we have seen clear signs of zero-fire generalization from RL agents. While these agents are beginning to be generally capable in this task space, we look forward to continuing our research and development to further improve their performance and create ever more adaptive agents.

For more details, see the preliminary version of our technical paper – and videos of the results we saw. Hopefully, this will help other researchers similarly see a new path toward creating more adaptive and generally capable AI agents. If you are excited about these developments, consider joining our team.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.