Based on transformants, Enformer’s new architecture advances genomic research by improving the ability to predict how DNA sequences will affect gene expression.
When the Human Genome Project succeeded in mapping the DNA sequences of the human genome, the international research community was excited by the opportunity to better understand the genetic instructions that influence human health and development. DNA carries the genetic information that determines everything from eye color to susceptibility to certain diseases and disorders. About 20,000 fragments of DNA in the human body known as genes contain instructions for the amino acid sequence of proteins, which perform many essential functions in our cells. However, these genes make up less than 2% of the genome. The remaining base pairs—which account for 98% of the 3 billion “letters” in the genome—are called “non-coding” and contain less understood instructions about when and where genes should be produced or expressed in the human body. At DeepMind, we believe that AI can unlock a deeper understanding of such complex areas, accelerating scientific progress and offering potential benefits to human health.
Nature Methods today published “Efficient Prediction of Gene Expression from Sequence by Incorporating Long-Range Interactions” (first shared as a preliminary publication on bioRxiv), in which we — in collaboration with our colleagues at Alphabet at Calico — introduced a neural network architecture called Enformer that increased Significantly increased accuracy in predicting gene expression from DNA sequencing. To advance the study of gene and disease-causing regulation, we also made our model and its initial predictions for common genetic variants openly available here.
Previous work on gene expression has typically used convolutional neural networks as basic building blocks, but their limitations in modeling the effect of remote enhancers on gene expression have hampered their accuracy and application. Our initial explorations relied on Basenji2, which can predict regulatory activity from a relatively long DNA sequence of 40,000 base pairs. Motivated by this work and the knowledge that regulatory DNA elements can influence expression at greater distances, we saw the need for a fundamental architectural change to capture long sequences.
We have developed a new paradigm based on switches, common in natural language processing, to take advantage of self-attention mechanisms that can integrate a much larger DNA context. Because adapters are ideal for looking at long sections of transcript, we adapted them to ‘read’ significantly expanded DNA sequences. By efficiently manipulating sequences to consider interactions at distances greater than 5 times (that is, 200,000 base pairs) than the length of previous methods, our construct can model the influence of important regulatory elements called enhancers on gene expression from a further distance within a DNA sequence.
To better understand how Enformer interprets DNA sequences to make more accurate predictions, we used contribution scores to highlight the parts of the input sequence that most influence prediction. Matching biological intuition, we observed that the model paid attention to the enhancers even if there were more than 50,000 base pairs away from the gene. Predicting enhancers that regulate genes remains a major unresolved problem in genomics, so we were pleased to see that Enformer’s contribution results perform comparatively with existing methods developed specifically for this task (using experimental data as input). Enformer also learned about insulator elements, which separate two independently regulated regions of DNA.
Although it is now possible to study the DNA of an entire organism, complex experiments are required to understand the genome. Despite tremendous experimental efforts, the vast majority of DNA’s control over gene expression remains a mystery. With AI, we can explore new possibilities for finding patterns in the genome and making mechanistic hypotheses about sequence changes. Similar to a spell checker, Enformer partially understands the vocabulary of a DNA sequence and can therefore highlight modifications that may alter gene expression.
The main application of this new model is to predict which changes to DNA letters, also called epigenetic variants, will alter gene expression. Compared with previous models, Enformer is significantly more accurate at predicting the effects of variants on gene expression, both in the case of normal genetic variants and synthetic variants that alter important regulatory sequences. This property is useful for explaining the increasing number of disease-associated variants obtained through genome-wide association studies. Variants associated with complex genetic diseases are predominantly located in the non-coding region of the genome, and are likely to cause disease by altering gene expression. But due to the inherent associations between variants, many of these disease-related variants are only pseudo-correlated and not causative. Computational tools can now help distinguish between true and false positive associations.
We are far from solving the untold mysteries that remain in the human genome, but Enformer is a step forward in understanding the complexity of genomic sequences. If you are interested in using AI to explore how the basic processes of cells work, how they are encoded in DNA sequences, and how to build new systems to advance genomics and our understanding of disease, we are hiring. We also look forward to expanding our collaborations with other researchers and institutions eager to explore computational models to help solve open questions at the heart of genomics.