Scientists have created an AI system capable of producing synthetic enzymes from scratch. In lab tests, some of these enzymes worked as well as those found in nature, even when their synthetically produced amino acid chains diverged significantly from that of any known natural protein.
Experience shows that natural language processing, although it was developed to read and write language texts, can learn at least some basic principles of biology. Salesforce Research has developed an AI program, called ProGen, that uses next code prediction to assemble amino acid sequences into artificial proteins.
The new technology could become even more powerful than directed evolution, the Nobel Prize-winning protein design technique, scientists said, and will revitalize the 50-year-old field of protein engineering by speeding up the development of new proteins that can be used for almost anything. From treatments to degrading plastics.
“Synthetic designs work much better than designs inspired by the evolutionary process,” said James Fraser, PhD, professor of bioengineering and therapeutic sciences at the University of California, San Francisco School of Pharmacy and author of the work, which was published Jan. 26. , in Nature Biotechnology.
“The language model learns aspects of evolution, but it differs from the normal evolutionary process,” said Fraser. “We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme is incredibly thermostable or likes acidic environments or won’t interact with other proteins.”
To create the model, the scientists simply entered the amino acid sequences of 280 million different proteins of all types into the machine learning model and let it digest the information for a few weeks. Next, they fine-tuned the model by outfitting it with 56,000 sequences from five lysozyme families, along with some contextual information about these proteins.
The model quickly generated a million sequences, and the research team selected 100 for testing, based on how similar they were to the sequences of natural proteins, as well as how natural the amino acid “bases” and “markers” underlying the AI proteins were.
From this first batch of 100 proteins, which were screened in the lab by Tierra Biosciences, the team made five synthetic proteins to test in cells and compared their activity to an enzyme found in chicken egg whites, known as chicken egg lysozyme. (HEWL). Similar lysozymes are found in human tears, saliva and milk, where they defend against bacteria and fungi.
Two of the synthetic enzymes were able to break down the cell walls of bacteria with similar activity as HEWL, yet their sequences were only 18% identical to each other. The two sequences were 90% and 70% identical to any known protein.
Just one mutation in a natural protein can make it stop working, but in a different round of screening, the team found that the AI-generated enzymes showed activity even when less than 31.4% of their sequence was similar to any known natural protein.
The AI was able to learn how the enzymes are formed, simply by studying the elemental sequence data. When measured using X-ray crystallography, the atomic structures of the synthetic proteins looked as they should, although the sequences were unlike anything seen before.
Salesforce Research developed ProGen in 2020, based on a type of natural language programming their researchers originally developed to generate English text.
They knew from their previous work that an AI system could teach itself grammar and the meaning of words, along with other basic rules that make writing well-formed.
“When you train sequence-based models with a lot of data, it’s really powerful in learning structure and rules,” said Nikhil Naik, Ph. D., research director of artificial intelligence at Salesforce Research, and lead author of the paper. “They learn words that can occur, as well as composition.”
With proteins, the design options were almost unlimited. Lysozymes are small like proteins, containing up to 300 amino acids. But with a potential 20 amino acids, that’s a massive number (20300) of possible combinations. This is greater than taking all humans who have ever lived through time, multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe.
Given the endless possibilities, it is remarkable that the model can easily generate functioning enzymes.
“The ability to generate functional proteins from scratch out of the box signifies that we are entering a new era of protein design,” said Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper. First author. “This is a new, versatile tool available to protein engineers, and we look forward to seeing therapeutic applications.”
More information: https://github.com/salesforce/progen