Enabling high-resolution prediction of protein structure at the protein scale

Wireless

AlphaFold method

Several new machine learning innovations contribute to AlphaFold’s current level of accuracy. We provide a high level overview of the system below; For a technical description of the network structure, see our AlphaFold methods paper and especially the extensive Supplementary Information.

AlphaFold network consists of two main phases. Phase 1 takes as an entry the amino acid sequence and multiple sequence alignment (MSA). Its goal is to find out an information-rich “binary representation” of nearby residue pairs in three-dimensional space.

Stage 2 uses this representation to directly produce atomic coordinates by treating each residue as a separate object, predicting the rotation and translation needed to position each residue, and finally assembling an ordered string. Network design relies on our intuitions about protein physics and geometry, for example, in the shape of updates applied and in loss selection.

Interestingly, we can produce a 3D structure based on the representation in the intermediate layers of the network. The resulting “track” videos show how AlphaFold’s belief in correct structure develops during reasoning, layer by layer. A hypothesis usually emerges after the first few layers followed by a lengthy refinement process, although some targets require the full depth of the network to reach a good prediction.

The predicted structure of CASP14 targets T1044, T1024 and T1064 in successive layers of the network. The structures are colored by the residue number and the counter displays the current layer.

Accuracy and confidence

AlphaFold was rigorously evaluated in the CASP14 trial, in which participants blindly predicted resolved but not yet announced protein structures. The method achieved high accuracy in most cases, with an average 95% RMSD-Cα for the experimental structure of less than 1Å. In our papers, we evaluated the model on a much larger set of recent PDB entries. Among the results are strong performance on large proteins and good side-chain accuracy as backbones are well predicted.

Accuracy of CASP14 for AlphaFold relative to other methods. RMSD-Cα is dependent on 95% of the most predicted residues for each target.

An important factor in the utility of structure predictions is the quality of the associated confidence measures. Can the model identify which parts of the prediction are likely to be reliable? We developed two confidence metrics on top of the AlphaFold network to address this question.

The first is pLDDT (predicted lDDT-Cα), which is a per-residue measure of local confidence on a scale from 0 to 100. pLDDT can vary significantly along the chain, allowing the model to express high confidence in the regulated domains but low confidence in the links between them, For example. In our paper, we provide evidence that some regions with low pLDDT may be dysregulated in isolation; Either intrinsically disordered or merely organized within a larger complex context. Regions with pLDDT < 50 should only be interpreted as predicting a potential disorder.

The second metric is PAE (predicted alignment error), which reports the predicted location error of AlphaFold at residue x, when the predicted and real structures are aligned on residue y. This is useful for assessing confidence in global features, especially domain fills. For residues x and y derived from two different domains, a consistently low PAE value at (x, y) indicates that AlphaFold is confident about the relative domain positions. A consistently elevated PAE at (x, y) indicates that the relative positions of the domains should not be interpreted. The general approach used to produce PAE can be adapted to predict a variety of overlay-based measures, including TM-Score and GDT.

Confidence per residue (pLDDT) and predicted alignment error (PAE) for two example proteins (P54725, Q5VSL9). Both have confident single domains, but the latter also has confident relative domain sites. Note: Q5VSL9 was resolved after this prediction was released.

To be sure, AlphaFold models are ultimately predictors: although they are often very accurate, they are sometimes wrong. The predicted atomic coordinates must be interpreted carefully, and in the context of these confidence measures.

open source

Along with our method paper, we have made the AlphaFold source code available on GitHub. This includes accessing a trained model and script to make predictions about a new input sequence. We believe this is an important step that will enable the community to use and build on our work. The easiest way to fold a single new protein using the AlphaFold is to use our Colab notebook.

The open source code is an updated version of our CASP14 system based on the JAX framework, and it achieves the same high precision. It also incorporates some recent performance improvements. AlphaFold’s speed has always depended heavily on the length of the input sequence, with short proteins taking minutes to process and only very long proteins taking hours. Once the MSA is assembled, the open source version can now predict the structure of 400 remaining proteins in just over a minute of GPU time on the V100.

Protein Scale and AlphaFold DB

AlphaFold’s fast induction times allow the method to be applied to the full protein range. In our paper, we discuss AlphaFold human protein predictions. However, we have since generated predictions for reference proteins for a number of model organisms, pathogens, and economically important species, and large-scale prediction is now routine. Interestingly, we observe a difference in the distribution of pLDDT between species, with generally higher confidence in bacteria and archaea and lower confidence in eukaryotes, which we hypothesize may be related to the prevalence of perturbation in these proteins.

No single research group can fully explore such a large set of data, and so we’ve partnered with EMBL-EBI to make the predictions freely available via the AlphaFold DB. Each prediction can be viewed alongside the confidence measures described above. A bulk download is also provided for each genre, and all data is covered by the CC-BY-4.0 license (making it freely available for both academic and commercial use). We are very grateful to EMBL-EBI for working with us to develop this new resource. Over the coming months, we plan to expand the dataset to more than 100 million proteins in UniRef90.

Example: AlphaFold DB predictions from a variety of organisms.

confidence distribution of each residue for 14 species; From left to right: bacteria/archaea, animals and protists.

In AlphaFold DB, we chose to share predictions of complete protein chains up to 2700 amino acids in length, rather than cropping to individual domains. The rationale is that this avoids the loss of structured areas that have yet to be explained. It also provides context from the complete amino acid sequence, and allows the model to attempt to predict domain packing. AlphaFold’s intra-domain accuracy has been extensively evaluated in CASP14 and is expected to be higher than its intra-domain accuracy. However, AlphaFold was the highest-ranked method in the assessment among domains, and we expect it to produce an informative prediction in some cases. We encourage users to view the PAE diagram to determine if domain placement might be useful.

future work

We are excited about the future of computational structural biology. There are still several important topics to be addressed: structure prediction of complexes, incorporation of non-protein components, capture dynamics and response to point mutations. The development of network architectures such as AlphaFold that excel at the task of understanding protein structure is reason for optimism that we can make progress on related problems.

We see AlphaFold as a complementary technology to experimental structural biology. This is perhaps best illustrated by their role in helping to resolve experimental structures, through molecular replacement and docking in EM cryo volumes. Both apps can speed up your existing search, saving months of effort. From a bioinformatics perspective, AlphaFold’s speed enables the generation of predicted structures on a large scale. This has the potential to open up new avenues of research, by supporting structural investigations of the contents of large sequence databases.

Ultimately, we hope AlphaFold will prove to be a useful tool for illuminating the protein space, and we look forward to seeing how it is applied in the months and years to come.

‍

We’d love to hear your feedback and understand how AlphaFold and AlphaFold DB have been helpful in your research. Share your stories at alphafold@deepmind.com.

Source link

Techspiro5

Enabling high-resolution prediction of protein structure at the protein scale

AlphaFold method

Accuracy and confidence

open source

Protein Scale and AlphaFold DB

future work

Post a Comment

How to Get Canva Pro for Free?

AI can perform 1 million microbial experiments a year - ScienceDaily

Should I upgrade my 3D printer to a faster one? Not so fast

Think Monetized Kids News and other VC news

Seagate's expensive Xbox Storage expansion cards are finally getting a price cut

Ahmed Haroud