The scientific community has galvanised in response to the recent COVID-19 outbreak, building on decades of basic research characterising this virus family. Labs at the forefront of the outbreak response shared genomes of the virus in open access databases, which enabled researchers to rapidly develop tests for this novel pathogen. Other labs have shared experimentally-determined and computationally-predicted structures of some of the viral proteins, and still others have shared epidemiological data. We hope to contribute to the scientific effort using the latest version of our AlphaFold system by releasing structure predictions of several under-studied proteins associated with SARS-CoV-2, the virus that causes COVID-19. We emphasise that these structure predictions have not been experimentally verified, but hope they may contribute to the scientific community’s interrogation of how the virus functions, and serve as a hypothesis generation platform for future experimental work in developing therapeutics. We’re indebted to the work of many other labs: this work wouldn’t be possible without the efforts of researchers across the globe who have responded to the COVID-19 outbreak with incredible agility.
Knowing a protein’s structure provides an important resource for understanding how it functions, but experiments to determine the structure can take months or longer, and some prove to be intractable. For this reason, researchers have been developing computational methods to predict protein structure from the amino acid sequence. In cases where the structure of a similar protein has already been experimentally determined, algorithms based on “template modelling” are able to provide accurate predictions of the protein structure. AlphaFold, our recently published deep learning system, focuses on predicting protein structure accurately when no structures of similar proteins are available, called “free modelling”. We’ve continued to improve these methods since that publication and want to provide the most useful predictions, so we’re sharing predicted structures for some of the proteins in SARS-CoV-2 generated using our newly-developed methods.
It’s important to note that our structure prediction system is still in development and we can’t be certain of the accuracy of the structures we are providing, although we are confident that the system is more accurate than our earlier CASP13 system. We confirmed that our system provided an accurate prediction for the experimentally determined SARS-CoV-2 spike protein structure shared in the Protein Data Bank, and this gave us confidence that our model predictions on other proteins may be useful. We recently shared our results with several colleagues at the Francis Crick Institute in the UK, including structural biologists and virologists, who encouraged us to release our structures to the general scientific community now. Our models include per-residue confidence scores to help indicate which parts of the structure are more likely to be correct. We have only provided predictions for proteins which lack suitable templates or are otherwise difficult for template modeling. While these understudied proteins are not the main focus of current therapeutic efforts, they may add to researchers’ understanding of SARS-CoV-2.
Normally we’d wait to publish this work until it had been peer-reviewed for an academic journal. However, given the seriousness and time-sensitivity of the situation, we’re releasing the predicted structures as we have them now, under an open license so that anyone can make use of them.
Interested researchers can read more technical details about these predictions in a document included with the data. The protein structure predictions we're releasing are for SARS-CoV-2 membrane protein, protein 3a, Nsp2, Nsp4, Nsp6, and Papain-like proteinase (C terminal domain). To emphasise, these are predicted structures which have not been experimentally verified. Work on the system continues for us, and we hope to share more about it in due course.