Currently, there are around 100 million known distinct proteins, with many more found every year. Each one has a unique 3D shape that determines how it works and what it does.
But figuring out the exact structure of a protein remains an expensive and often time-consuming process, meaning we only know the exact 3D structure of a tiny fraction of the proteins known to science.
Finding a way to close this rapidly expanding gap and predict the structure of millions of unknown proteins could not only help us tackle disease and more quickly find new medicines but perhaps also unlock the mysteries of how life itself works.
These sequences are assembled according to the genetic instructions of an organism's DNA.
Attraction and repulsion between the 20 different types of amino acids cause the string to fold in a feat of ‘spontaneous origami’, forming the intricate curls, loops, and pleats of a protein’s 3D structure.
For decades, scientists have been trying to find a method to reliably determine a protein’s structure just from its sequence of amino acids.
This grand scientific challenge is known as the protein folding problem.
It was taught by showing it the sequences and structures of around 100,000 known proteins.
Experimental techniques for determining structures are painstakingly laborious and time consuming (sometimes taking years and millions of dollars).
Our latest version can now predict the shape of a protein, at scale and in minutes, down to atomic accuracy.
This is a significant breakthrough and highlights the impact AI can have on science.
CASP is a community forum that allows researchers to share progress on the protein folding problem. The community also organises a biennial challenge for research groups to test the accuracy of their predictions against real experimental data.
Teams are given a selection of amino acid sequences for proteins which have had their exact 3D shape mapped but have not yet been released into the public domain. Groups must submit their best predictions to see how close they are to the subsequently revealed structures.
Among the teams that participated in CASP13 (2018), AlphaFold placed first in the protein structure prediction challenge. At CASP14 (2020), we presented our latest version of AlphaFold, which has now reached a level of accuracy considered to solve the protein structure prediction problem.
Our work builds upon decades of research by CASP’s organisers and the protein folding community, and we’re indebted to the countless number of people who have contributed protein structures over the years, making such rigorous evaluations possible.
The AlphaFold Protein Structure Database, created in partnership with Europe’s flagship laboratory for life sciences (EMBL’s European Bioinformatics Institute), builds on decades of painstaking work done by scientists using traditional methods to determine the structure of proteins.
Our first release covers over 350,000 structures, including the human proteome - all of the ~20,000 known proteins expressed in the human body - along with the proteomes of 20 additional organisms important for biological research, including yeast, the fruit fly and the mouse.
These organisms are central to modern biological research, including Nobel Prize winning discoveries and life-saving drug development.
Their release dramatically expands our knowledge of protein structures and more than doubles the number of high-accuracy human protein structures available to scientists around the world.
AlphaFold is already being used by our partners. For instance, the Drugs for Neglected Diseases Initiative (DNDi) has advanced their research into life-saving cures for diseases that disproportionately affect the poorer parts of the world, and the Centre for Enzyme Innovation at the University of Portsmouth (CEI) is using AlphaFold's predictions to help engineer faster enzymes for recycling some of our most polluting single-use plastics.
A team at the University of Colorado Boulder is finding promise in using AlphaFold predictions to study antibiotic resistance, while a group at the University of California San Francisco has used them to increase their understanding of SARS-CoV-2 biology.
In the coming months we plan to vastly expand the AlphaFold Protein Structure Database to almost every sequenced protein known to science. Adding predictions of more than 100 million structures contained in the UniProt reference database, the most comprehensive resource of protein sequences, will create a veritable protein almanac of the world.
And the system and database will periodically be updated as we continue to invest in future improvements to AlphaFold.
We’re excited about this next phase of AlphaFold’s journey, and look forward to continuing our work with the global scientific community to unlock the potential of the building blocks of life.