Our brain has an amazing ability to process visual information. We can take one look at a complex scene, and within milliseconds we’ll be able to parse it into objects and their attributes, such as color or size, and use that information to describe the scene in simple language. Behind this seemingly effortless ability lies a complex computational process performed by our visual cortex, which involves taking millions of nerve impulses transmitted from the retina and converting them into a more significant form that can be mapped to simple language descriptions. In order to fully understand how this process works in the brain, we need to know both how semantic information is represented in firing neurons at the end of the visual processing hierarchy, and how such representation can be learned from largely unexamined experience.

To answer these questions in the context of face perception, we teamed up with our collaborators at Caltech (Doris Tsao) and the Chinese Academy of Sciences (Le Chang). We chose faces because they are well-studied in the neuroscience community and are often seen as a “microcosm of object recognition.” In particular, we wanted to compare the responses of single cortical neurons in facial patches at the end of the visual processing hierarchy, recorded by our collaborators with a recently emerged class of “unsynaptic” deep neural networks in which, unlike usual “black-box” systems explicitly aim to be The “unsynaptic” neural network learns to map complex images into a small number of inner neurons (called latent units), each one representing a single semantically significant feature of a scene, such as the color or size of an object (see Figure 1). Unlike “black box” deep classifiers trained to recognize visual objects through a biologically unrealistic amount of external supervision, these disengagement models are trained without an external learning signal using a self-supervised target to reconstruct input images (generated in Figure 1) from I learned the latent representation (obtained through inference in Figure 1).
Disentangling was supposed to be important in the machine learning community nearly ten years ago as a key ingredient for building more data-efficient, portable, fair, and imaginative AI systems. However, for years, building a model that could disassemble in practice eluded the field. The first model capable of doing this successfully and robustly, called β-VAE, was developed by taking inspiration from neuroscience: β-VAE learns by predicting its own input; requires a similar visual experience for successful learning as that experienced by children; Its acquired latent representation reflects the known properties of the visual brain.
In our new paper, we measured the similarity of the unsynaptic units detected by β-VAE trained on a dataset of facial images with the responses of single neurons at the end of visual processing recorded in primates looking at the same faces. Neurological data were collected by our collaborators under strict supervision of the Institutional Animal Care and Use Committee at Caltech. When we did the comparison, we found something surprising—the handful of unsynaptic units detected by β-VAE seemed to behave as if they were equivalent to a similarly sized subset of real neurons. When we looked more closely, we found a strong one-to-one mapping between real neurons and artificial cells (see Figure 2). This mapping was much stronger than alternative models, including deep classifiers that were previously considered state-of-the-art computational models for visual processing, or the handcrafted model of face perception as the ‘gold standard’ in the neuroscience community. Not only that, the β-VAE units were encoding meaningful information such as age, gender, eye size, or smile presence, enabling us to understand the traits that single neurons in the brain use to represent faces.
.jpg)
If β-VAE is indeed able to automatically detect artificial latent units that are equivalent to real neurons in terms of how they respond to face images, then it should be possible to translate the activity of real neurons into their identical synthetic counterparts, and use them generatively (see Figure 1) for β- VAE trained to visualize what real neurons experience. To test this, we presented the monkeys with new facial images that the model had not tested before, and checked whether we could display them using a β-VAE generator (see Figure 3). We found that this was indeed possible. Using the activity of at least 12 neurons, we were able to generate facial images that were more accurate reconstructions of the originals and of better visual quality than those produced by the deep surrogate models. This is despite the fact that alternative models are known to be better image generators than β-VAE in general.
.jpg)
Our results summarized in the new paper indicate that the visual brain can be understood at the level of a single neuron, even at the end of the processing hierarchy. This is contrary to the popular belief that semantic information is multiplied between a large number of such neurons, each of which remains largely individually uninterpretable, in contrast to how information is encoded across entire layers of artificial neurons in deep classifiers. Not only that, our findings suggest that it is possible for the brain to learn to support our easy-to-do ability of visual perception by improving the goal of disentangling. While β-VAE was originally developed with inspiration from higher-order neuroscience principles, the utility of unentangled representations of intelligent behavior has so far been demonstrated mainly in the machine learning community. In line with the rich history of mutually beneficial interactions between neuroscience and machine learning, we hope that the latest insights from machine learning will now return to the neuroscience community to investigate the advantage of unsynaptic representations to support intelligence in biological systems, in particular the basis of abstract reasoning, or effective and amenable task learning. to generalize.