Exploring Alphafold Model (Part 1)

William McNamara • February 3, 2024

If you've been following the computational biology space like I have, you've probably heard about AlphaFold and the protein folding problem. As a data scientist who's been fascinated by this intersection of deep learning and molecular biology, I wanted to explore whether I could build a simpler but effective protein structure prediction model on my own. In this post, I'll walk you through my journey of creating an attention-based model that can predict protein structures from amino acid sequences.

The Protein Folding Challenge: Why It Matters

Proteins are the workhorses of life, responsible for virtually every biological process in our cells. Each protein has a unique 3D structure that determines its function, but experimentally determining these structures is painfully slow and expensive—sometimes taking years and millions of dollars.

If we look at a protein closely, it's essentially a string of amino acids (like beads on a necklace) that folds into a complex three-dimensional shape. While we know the sequences of billions of proteins, we've mapped far fewer of their structures. This gap represents one of the grand challenges in computational biology. That's where computational approaches like AlphaFold come in—they can predict a protein's structure from its amino acid sequence alone, in minutes rather than years.

Alphafold

Getting AlphaFold running on Google Cloud presented some unique challenges. The original system was designed for powerful research clusters, not the more constrained environment of a Collab notebook. I had to make some careful adaptations to work within these limitations. The trickiest part was getting the OpenMM physics engine to work correctly, but once I figured that out the rest fell into place.

The real magic of AlphaFold comes from how it leverages evolutionary information. When a protein evolves, certain positions remain conserved if they're critical for structure or function. By analyzing many related protein sequences, AlphaFold can infer which positions are likely to be close to each other in 3D space. To gather this evolutionary information, we need to search large sequence databases. In my cloud implementation, I use three key databases: the 'Universal Reference Cluster' and 'MGnify' databases provided by the European Bioinformatics Institute; and the 'Big Fantastic Database" created by Martin Steinegger and Johannes Söding.

For each protein sequence we want to analyze, we use a tool called Jackhmmer to search these databases for homologous sequences. This creates what's called a Multiple Sequence Alignment (MSA) which is essentially a matrix showing how amino acids vary at each position across evolutionarily related proteins. One fascinating thing about this process is watching how the search finds distant evolutionary relatives. For some well-studied proteins, I found thousands of related sequences; for others that are more unique there were only find a handful. This directly impacts prediction accuracy as more related sequences generally means better predictions.

With our MSAs in hand, we're ready to run AlphaFold itself. The model has several key components:

Evoformer blocks that process the MSA and extract evolutionary patterns
A structure module that converts this information into 3D coordinates
A confidence predictor that estimates how accurate each part of the prediction is

What's happening inside is super interesting, the neural network is essentially learning the physical rules that govern protein folding, without being explicitly programmed with those rules. The output includes not just the 3D coordinates, but also confidence scores that tell us which parts of the prediction we can trust. The main confidence metric is pLDDT (predicted Local Distance Difference Test), which ranges from 0-100. This provides a visual sense of which regions are likely correct and which should be taken with a grain of salt.

Predicting My First Structure

For my first prediction, I decided to start with a fairly simple protein, a small zinc finger protein with just 74 amino acids:

MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

After entering this sequence and hitting "Run," I watched AlphaFold go through its paces:

First, it searched the sequence databases, finding about 200 related sequences
Then it processed these sequences through the neural network
Finally, it produced a relaxed 3D model of the protein

The moment of truth came when I visualized the result. As I would expect from a zinc protein, the protein folded into a compact structure with a clear alpha-helical pattern. The confidence scores were mostly in the high range (70-90), suggesting this was likely a reliable prediction. What's remarkable is that this entire process took just a few minutes on a standard Google GPU. A decade ago, this level of accuracy would have been considered impossible, and even five years ago it would have required massive computational resources.

Understanding the Predicted Structure

One of my favorite aspects of working with AlphaFold is the rich visualization and analysis tools. After generating a prediction, I can examine it from multiple angles. Using a pLDDT plot I can graph the confidence scores per residue to help identify flexible regions. For my zinc finger protein, the pLDDT plot showed high confidence in the core helical regions, with slightly lower confidence at the termini. This pattern makes biological sense to me, as protein ends are often more flexible and thus have more iterations.

Closing Thoughts

I'm very excited to continue experimenting with AlphaFold. Even more excited to see how these models are continually refined and applied to advanced problems in computational biology. Especially at a time when the morality of many machine learning applications could generously be described as questionable, it's nice to work with a model that will only lead to improved understanding of science and biology! Complete code can be found on my GitHub here.

< Older Post

Newer Post >

Mail

Joe Biden's AI Legacy

By William McNamara • December 11, 2024

Not a whole lot for Trump to undo

Exploring Alphafold Model (Part 2)

By William McNamara • October 15, 2024

ColabFold has changed the game for amateur protein folding analysis

Extracting Patterns from Genomic Data

By William McNamara • December 22, 2023

A continuation of genome sequencing analysis

Biden's Executive Order on AI: It's a start

By William McNamara • November 7, 2023

A good step but not nearly enough

Dimensionality Expansion for Genome Sequencing

By William McNamara • August 1, 2023

Introductory methods for genome sequencing

Fun with Spotify's API

By William McNamara • March 19, 2023

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.