Build architectures that can handle the world's data

Estimated read time: 5 min

Wireless

Perceiver and Perceiver IO serve as multipurpose AI tools

Most architectures used by AI systems today are specialized. A residual 2D network might be a good option for image processing, but is at best loosely suited to other types of data — such as the Lidar signals used in self-driving cars or the torque used in robots. Furthermore, standard architectures are often designed with only one task in mind, and often lead engineers to bend over backwards to reshape, distort, or tweak their inputs and outputs in hopes that the standard architecture will learn to handle their problem properly. Dealing with more than one type of data, such as the sounds and images that make up videos, is much more complex and usually involves complex, hand-tuned systems built from different parts, even for simple tasks. As part of DeepMind’s mission to solve intelligence to advance science and humanity, we want to build systems that can solve problems that use many types of inputs and outputs, so we’ve begun to explore a generic, versatile architecture that can handle all kinds of data.

Figure 1. The Perceiver IO architecture maps input to output matrices via a small latent matrix, allowing it to scale safely even for very large inputs and outputs. Perceiver IO uses a global attention mechanism that circulates across many different types of data.

In a paper presented at ICML 2021 (International Machine Learning Conference) and published as preprint on arXiv, we introduced Perceiver, a general-purpose architecture that can process data including images, raster clouds, audio, video, and their combinations. . While the Perceiver could handle many types of input data, it was limited to tasks with simple output, such as classification. It describes a new initial version on arXiv of Perceiver IO, which is a more generic version of the Perceiver architecture. Perceiver IO can produce a variety of outputs from many different inputs, making it applicable to real-world areas such as language, vision, and multimedia comprehension as well as to demanding games such as StarCraft II. To help researchers and the machine learning community in general, we’ve now opened up the code.

Figure 2. Perceiver IO handles the language by choosing which characters to come to first. The model learns to use several different strategies: some parts of the grid fetch to specific places in the input, while others fetch specific characters such as punctuation marks.

Observers build on Transformer, an architecture that uses a process called “attention” to map inputs into outputs. By comparing all the input elements, Transformers process the input based on their relationships to each other and to the task. Attention is simple and broadly applicable, but switches use attention in a way that can quickly become expensive as the number of inputs grows. This means that the converters work fine with inputs containing a few thousand items at most, but common forms of data such as pictures, videos, and books can easily contain millions of items. Using the original Perceiver, we’ve solved a key problem for a generic architecture: scaling the switch’s attention operation to very large inputs without making domain-specific assumptions. Perceiver does this by using attention to first encode the input into a small latent array. This latent matrix can then be processed at a cost that is independent of the size of the input, enabling Perceiver’s memory and computational needs to grow gracefully as the input grows larger, even for particularly deep models.

Figure 3. Perceiver IO yields state-of-the-art results in the challenging task of estimating optical flow, or tracking the movement of all pixels in an image. The color of each pixel shows the direction and motion speed estimated by Perceiver IO, as shown in the legend above.

This “agile growth” allows the Perceiver to achieve an unprecedented level of generality – it is competitive with domain-specific models based on image-based criteria, 3D point clouds, sound and images combined. But because the original Perceiver produced only one output per input, it wasn’t as versatile as the researchers needed it to be. Perceiver IO fixes this problem by using attention not only to encode to a latent array but also to decrypt from it, giving the network great flexibility. Perceiver IO is now expanding to include a wide variety of inputs And outputs, and can even handle many tasks or data types simultaneously. This opens the door to all kinds of applications, such as understanding the meaning of text from each of its letters, tracking the movement of all points in an image, processing the sound, images and labels that make up a video clip, and even playing it. games, all while using a single architecture that’s simpler than the alternatives.

In our experiments, we’ve seen Perceiver IO work across a broad range of modular domains—such as language, vision, multimedia data, and games—to provide an off-the-shelf way to handle many types of data. We hope that the latest version of the initial version and the code available on Github will help researchers and practitioners address issues without having to invest the time and effort to build custom solutions using specialized systems. As we continue to learn from exploring new types of data, we look forward to further improving this general-purpose architecture and making it faster and easier to solve problems via science and machine learning.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.