Exploring Alphafold Model (Part 2)

William McNamara • October 15, 2024

I've written previously about my enthusiasm for implementing and using Deepmind's Alphafold model for protein folding. AlphaFold2 finally achieved what many thought impossible: predicting protein structures with near-experimental accuracy from sequence alone. But as an independent researcher without access to massive computational resources, I needed a more accessible way to harness this technology. Fortunately I have discovered ColabFold, a brilliant adaptation that makes cutting-edge protein structure prediction easy for just about anyone with an internet connection.

ColabFold

My first experience with ColabFold was one of the easiest experiences I've ever had with coding. I navigated to the notebook, pasted in the protein sequence I'd worked with previously, clicked "Run all," and within minutes, I was looking at a detailed 3D model that would have taken years to determine experimentally. While the interface is deceptively simple, under the hood incredibly sophisticated analysis is happening.

Preparing My Experiment

For my first prediction, I decided to use a relatively small protein with 56 amino acids:

PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK

This is a bacterial transcription factor that I've been studying for a while, but whose structure remains experimentally undetermined. One of the aspects I appreciate about ColabFold is how it handles various options. I decided not to use AMBER relaxation or any templates initially—meaning the prediction would be based purely on the sequence and evolutionary information, without reference to any known structures.

Behind the scenes, ColabFold installs all the necessary dependencies—and there are quite a few. Considering i spent the better part of a week implementing all the dependencies when I first started with Alphafold, shielding users from that complexity in this way is a pretty great feature of ColabFold.

Generating Multiple Sequence Alignments

This is where ColabFold truly shines compared to the original AlphaFold implementation. The standard AlphaFold pipeline uses HHblits to search against enormous databases like BFD and MGnify, which can take hours or days and requires terabytes of storage.

ColabFold uses MMseqs2, a much faster homology detection method, searching against databases like UniRef and environmental sequences. For my protein, I chose "mmseqs2_uniref_env" which searches both UniRef and environmental sequence databases for homologs. Again, a million times faster than my first implementation.

The evolutionary information captured in these Multiple Sequence Alignments (MSAs) is crucial for accurate prediction. As proteins evolve, certain positions change in coordinated ways to maintain structure and function. These coevolutionary patterns provide powerful clues about which amino acids are likely to be spatially close in the folded protein. When I ran this step, ColabFold found hundreds of related sequences for my protein—a good sign that the prediction would be reliable. I could see this visually in the MSA coverage plot that showed evolutionary conservation across the protein's length.

Running the Prediction

Time for the meat of the analysis. ColabFold feeds the MSA and configuration into AlphaFold's neural networks, which have been trained on the entire Protein Data Bank plus additional structures. For my protein, the process took only about 5 minutes on the Google Colab GPU—orders of magnitude faster than experimental methods and even multiple times faster than my previous Alphafold implementation.

When the prediction finished, I was greeted with a stunning 3D visualization of my protein, colored by confidence. The predicted structure showed a compact globular fold with several alpha helices—typical for a DNA-binding protein.

In addition pretty visualization; it provides rich data to assess the reliability of the prediction. The model produces a per-residue confidence score called pLDDT (predicted Local Distance Difference Test), ranging from 0-100, with higher values indicating greater confidence. My protein showed high confidence (70-90) across most of its length, with slightly lower confidence at the termini—exactly what you'd expect, since protein ends usually have more iterations.

The visualization uses color coding to make this immediately apparent:

Blue regions (90-100): Very high confidence

Light blue (70-90): Confident

Yellow (50-70): Low confidence

Red (<50): Very low confidence

For multimeric proteins, ColabFold also provides Predicted Aligned Error (PAE) plots that show confidence in the relative positioning of residues—crucial for assessing interface quality in protein complexes.

ColabFold something that surprised me is that it didn't just give me one prediction; it provided five ranked models. I could examine each one by changing the rank number in the visualization cell. For my protein, the top-ranked model had the highest average pLDDT score, but the other models showed similar overall folds with minor variations in loop regions—consistent with what we know about protein dynamics.

Final Thoughts

Exciting as all of this is, I had to ask myself if this was categorically better than Alphafold? And the more i learned the more complicated that answer got. One limitation is that Google Colab assigns different GPUs with varying memory limits, so sometimes a long protein or complex will exceed available memory. Also while MMseqs2 is faster, it sometimes finds fewer homologous sequences than the full AlphaFold pipeline which includes other databases, which could affect the prediction accuracy.

It seems like ColabFold sacrifices a little bit of accuracy for a lot of speed, ease, and meta-analysis. Which is great! But if I'm entering into a competition or writing a research paper, I'd probably still use the full AlphaFold pipeline locally to get the best results.

But if lime me you're exploring protein folding as a passion project, or even if you're actually working with proteins in any capacity—whether you're a researcher, student, or educator—I highly recommend giving ColabFold a try. The barrier to entry is minimal, and the potential insights are enormous. As the authors of ColabFold eloquently put it, their goal is "making protein folding accessible to all." In my experience, they've succeeded brilliantly.

As always my code can be found on my GitHub here.

< Older Post

Newer Post >

Mail

Joe Biden's AI Legacy

By William McNamara • December 11, 2024

Not a whole lot for Trump to undo

Exploring Alphafold Model (Part 1)

By William McNamara • February 3, 2024

Getting started with Deepmind's revolutionary model for protein folding

Extracting Patterns from Genomic Data

By William McNamara • December 22, 2023

A continuation of genome sequencing analysis

Biden's Executive Order on AI: It's a start

By William McNamara • November 7, 2023

A good step but not nearly enough

Dimensionality Expansion for Genome Sequencing

By William McNamara • August 1, 2023

Introductory methods for genome sequencing

Fun with Spotify's API

By William McNamara • March 19, 2023

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.