DeepMind's Protein LEGO Project

Human genome project gave us the blueprint, DeepMind gave us the LEGOs

Aug 05, 2021

On July 22, DeepMind announced what will come to be known as one of the greatest milestones of biological understanding. With a feat as significant as sequencing the human genome, DeepMind's AlphaFold has resolved the predicted 3D structures for the entire known proteome of 14 of the best-studied organisms including humans, naturally.

If the human genome project gave us the blueprint, DeepMind just gave us the LEGOs.

https://media.giphy.com/media/l0JMrPWRQkTeg3jjO/giphy.gif

Starting from the blueprint

That DNA was the molecule of heredity took until the 1940's to discover(1). Even once armed with that knowledge, it wasn't until 1953 that the basic 3D structure of DNA was discovered. That basic information allowed us to make incredible advances in understanding human genetics, but we still lacked the knowledge of the complete sequence of the human genome.

Because our tooling was limiting we could only answer certain questions about human genetics. As I said in Research Vaccine Communication, the answers we find through science are limited by the questions we ask and the tools we have to ask those questions.

Kennedy : BioOptimist @BioOptimist

In research you only get answers to the questions you ask. If you ask the wrong question you cannot find the correct answer.

By the end of the 20th century, we began an ambitious project to sequence the entire human genome, and by 2003 not one but two groups successfully accomplished this task(2).

Knowing the human genome created a new frontier of medicine. Genetic diseases can be identified and understood with a fraction of the resources they used to require. But I think one of the mistakes we make is to only think about these discoveries in the context of human health and medicine.

The technologies developed to sequence the human genome enabled entirely new industries. It's easy to point to 23&Me as an example here, but the implications are still developing. Take the latest boom of mental health companies: companies like Prairie Health are using genetic data to help people find more effective psychiatric medications. A desire to learn about one's ancestry is not mere curiosity - for many, it's a connection to identity that is otherwise impossible.

We still needed materials with which to build.

Okay, so learning the blueprint made a huge impact. But it's still just the first step of that central dogma of DNA → RNA → protein(3). Let's take a look at that final piece, the proteins.

https://media.giphy.com/media/T9uxGBHZDsR2rxh7zN/giphy.gif — Microtubles (the thing that looks like a sidewalk above) are proteins that form part of the structural scaffolds within cells. Kinesin (the thing that looks like it's strolling down the sidewalk) is another protein that moves items around the cell by 'walking' on the microtubules This isn't a video - proteins are too small - but this is what it would actually look like if video were possible. Kinesin's got swagger.

Proteins are so fundamental to life, it's impossible to overstate their importance. Proteins are how we manipulate the atomic and macromolecular materials around us and turn them into living machines.

Each protein has its own 3D shape, which can be influenced by the molecules around it. This 3D shape is super important...it's sort of like assembling something from IKEA. The pieces have to be exactly the right shape and be capable of moving in exactly the right way for the whole thing to work.

Protein structure is hard to study empirically

The study of protein shapes is notoriously difficult. Proteins are really small, so we can't just use a microscope to look at them. The tools we have been working with for decades were limited - you have to purify a whole bunch of the same protein, which is not only difficult, it also can mess up the shape of the protein you're trying to understand(4). And the process is so labor-intensive, scientists spend years trying to understand the shape of a single protein.

*This is the x-ray crystallograph of myoblobin from a sperm whale - the first protein structure to be resolved by x-ray crystallography in the late 1950's.*

Despite the limitations, the value of understanding the 3D structure is huge. 35% of FDA-approved pharmaceuticals on the market today target a family of proteins called GPCR. This was partly enabled because of the work lead by Brian Kobilka and Robert Leftkowitz. They were able to resolve the structure of a type of GPCR called β2AR (short for the beta-2 adrenergic receptor, and understanding the structure of β2AR created insights across the entire GPCR family. Makes sense that they shared the 2012 Nobel prize in chemistry for this discovery.

Enter, predictive modeling of protein structures

The idea of computer prediction of protein structures is not new. I used a basic version during my own PhD purely for aesthetic purposes - I thought it was a nice addition to my presentations so the non-molecular biologists in the room had something tangible to grasp on too. But good enough for graphics is not good enough for designing and engineering biology.

This isn't the same graphic I used, but this is the protein I studied in my dissertation! Say hello to MC4R - short for the melanocortin-4 receptor 🙂

Some very creative solutions came forward - my personal favorite is the Folding@home project first launched by Vijay Pande in 2000. Folding@home leveraged the distributed computing power of volunteers' personal computers, and simulated protein folding and dynamics. The project continues through today and is a beautiful harnessing of citizen science and distributed computational power to inch us closer to a complete understanding.

In parallel, the field of AI was maturing to require less and less structure in training data but also to report more and more in terms of assumptions and uncertainty. In 2014 Alphabet (parent company to Google) acquired an AI startup called DeepMind, which reportedly had also been in talks with Facebook. In December of 2020, DeepMind's AlphaFold made waves in the industry when they were recognized by the biennial Critical Assessment of protein Structure Prediction(5).

This was a landmark recognition, but a mere prelude of what was to come.

14 Species, 350,000 proteins, infinite potential

Coming back to last week's news, DeepMind has predicted the 3D structure of the known proteome for 14 species, complete with estimates about how accurate the predicted structure is. They've released the code base and, for now, access to the information is free.

From MIT Technology Review to the New York Times, there was great coverage of this announcement last week. Universally they focused on how this would help us "to understand diseases better and develop new drugs" and a "promising a boon for medicine and drug design".

This is true - existing tools make it extremely difficult to know the impact of protein changes. Often, changes in proteins break the system and the results lack nuance. It's like changing the piston in your motor - maybe it works better, maybe worse, but most often the motor just fails to work at all. In these situations, you have no understanding of why or how the failure occurred.

Here's the real kicker: human biology is a niche science compared to what AlphaFold unlocked.

https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif

More than the 350,000 proteins evaluated so far, AlphaFold can predict the structure of an infinite number of altered and newly designed proteins.

Now we can design and build more efficient pathways, we can recombine existing proteins in new ways, and we can design new proteins to perform the functions lacking in nature. Most importantly, this just became fast and cheap to do.

What used to take years and millions of dollars in funding can now be done in months. By using AlphaFold to design and predict proteins we can design and predict pathways requiring wet-lab experiments only to validate the most promising possibilities.

What this means for the bio-as-tech

Let's start with DeepMind's decision to release the information publicly without monetization - YET. It's helpful to understand that the IP potential in naturally occurring proteomes is virtually non-existent(6). As a result, releasing the known, naturally occurring proteome is the best move to position themselves as the experts in the field, and maximizes the benefits possible from their work.

That they aren't charging for access today does not rule out monetization in the future, which I think we will eventually see. Truthfully, I don't think that today anybody knows how to develop a rational monetization strategy for access to this particular information. The smartest thing to do is likely to see how people use the information for a while, then figure out what pricing structure fits the behavior observed.

We are nearly ready to go full hockey stick

Essential elements for highly innovative industries include low capital barriers and minimal infrastructure requirements to test and validate/fail quickly. Companies like Gingko Bioworks, Emerald Cloud Labs, and Science Exchange have already developed the platform infrastructures to enable preliminary wet lab work without dedicated physical infrastructure. This dramatically decreased the cost and time required for early experimental work. But until now, the sheer volume of experiments and the difficulty with getting interpretable results still presented a massive barrier to innovation.

AlphaFold fills this gap. Now it is possible to perform high-fidelity in silico evaluations of proteins, enabling entirely new systems and pathways to be designed and considered. I think there is still room for another product to come in as a top layer to AlphaFold and use their structural predictions to perform predictions of the protein interactions.

When all of these pieces come together, biology will achieve the accessibility of other material technologies, where we can predict, design, and largely validate with minimal capital and infrastructure requirements. This is when we begin to see the full potential of bio-as-tech and realize the potential to replace much of our existing materials technologies with smarter, more sustainable, bio-materials.

Biology is the future of technology.

Prior to this, there was a lot of speculation that proteins were the more likely candidate. The logic underlying this misunderstanding was pretty reasonable: there is great diversity between individuals and species, and there was more diversity in the components of proteins (amino acids, of which there are a couple dozen across biological lifeforms) than in the components of DNA (nucleic acids, which there are 5 of is most species).
This story is actually quite spicy! A disagreement in methodology, a scientist leaving the NIH project to prove that they can do it faster and cheaper, fueling the NIH team to move faster. The whole thing culminated with both teams publishing at the same time, and pulling off the rare miracle of the projects coming in ahead of time and under budget.
Biology students are taught a central dogma that DNA → RNA → Protein. What does DNA → RNA → Protein mean? It means that, while DNA is how we inherit traits, those traits are actually expressed through the proteins our DNA encodes. And in order for those proteins to be made, the information in the DNA has to be passed through RNA first.
If this sounds complicated, you're right. The general premise is pretty straightforward, but a lot can happen along the way.
There is an entire category of human diseases that are caused by aggregated clumps of proteins that are mis-formed, including Alzheimer's and Parkinson's.
I really want to ignore how bad the name "Critical Assessment of protein Structure Prediction" is...but seriously?! I swear some people want their work to be forgotten. Nevertheless, while they shouldn't be entrusted to naming things they ARE the right panel of experts to recognize the work of DeepMind's AlphaFold. This is a rare case where I recommend the companies own press release.
Yes, patents for naturally occurring proteins exist and have been approved in the past, but court rulings for DNA sequences long since set the precedent that naturally occurring DNA sequences cannot be patented. The rational is that it is an invention of nature, and humans are simply discovering rather than inventing these sequences. Subsequent rulings have upheld this rationale to apply to proteins as well.

BioOptimist