Dimensionality Expansion for Genome Sequencing

William McNamara • August 1, 2023

Joining my dimensionality expansion methods to my bioinformatics research

As data scientists, we've all hit that frustrating wall when working with language models or sequence analysis—the dreaded sequence length limitation. I've spent countless hours optimizing attention mechanisms and tweaking model architectures, only to gain marginal improvements in handling longer texts or genomic sequences. But what if we've been tackling this problem from the wrong angle all along?


Breaking Free from One-Dimensional Thinking


Recently, I've been exploring an alternative approach that has transformed how I work with large sequences. Rather than continually optimizing model architectures (which seems to be where most research focuses), I started questioning our fundamental representation of sequential data.


Think about it: why do we insist on representing sequences as one-dimensional arrays? In the physical world, we don't store large amounts of text in a single, unbroken line. Instead, we arrange words on pages and pages in books. This dimensional organization isn't just convenient—it's efficient.


This realization led me to a simple but powerful insight: what if we increased the dimensionality of our sequence representation?


From Lines to Planes to Volumes


Let me walk you through the conceptual progression: A single sentence is essentially a one-dimensional representation—a linear sequence of tokens. A page of text is two-dimensional, organizing that sequence into rows and columns. A book takes this further into three dimensions, stacking those pages into a compact volume.

With each increase in dimensionality, we gain a more efficient way to represent and transport the same information. This isn't just a physical convenience—it suggests a fundamentally different way to encode data for our models.


The Perfect Testing Ground: Biological Sequences

While working on a genomics project, I realized biological sequences offer the perfect testing ground for this approach for two key reasons:


  1. We have access to vast amounts of genomic data
  2. Biological sequences (particularly DNA and RNA) have a remarkably limited alphabet


The four-letter alphabet of DNA (A, T, G, C) is particularly appealing from an encoding perspective. We can easily one-hot encode each nucleotide, but the magic happens in how we arrange these encodings.


Reshaping Sequences into Images and Videos


Here's where the dimensional transformation comes into play. Instead of feeding a genomic sequence to a model as a 1D array, I reshape it into a 2D array—essentially creating an "image" where:


  • The (x,y) position encodes the sequence order
  • The channel dimension encodes the specific nucleotide at that position


Taking this further, we can reshape into 3D structures—creating something akin to a short video where the (x,y,t) coordinates represent position in the sequence.


Putting Theory Into Practice: The HIV Genome

To test this approach, I downloaded HIV genome sequences from the NCBI database. The average HIV genome is around 9,000 bases long, which conveniently reshapes into a 24×24×16 array (plus our channel dimension for the nucleotide encoding).

Since the final array has more elements than the typical genome, I applied a small amount of padding. With this transformation complete, I could now leverage architectures designed for image processing!

I trained a variational convolutional autoencoder on this reshaped data, which not only performed dimensionality reduction but also naturally clustered similar sequences together—all without requiring labeled data.

The most exciting part? The generative component of the model successfully reconstructed sequences, allowing us to compare them against fragments and evaluate reconstruction quality.

Expanding to Dengue and SARS-CoV-2


Encouraged by these results, I applied the same approach to Dengue virus genomes and observed similar clustering by sequence similarity.

The flexibility offered by expanding the number of dimensions results in a quick and simple method to analyze large sequences even when the dataset does not fit into memory. Data transformation can be done relatively quickly and the genomic sequences can be fed to the network by loading them in batches. This specific case can be found by analyzing the current pandemic virus.

The real test came when working with SARS-CoV-2 data. Given the volume of sequences, I implemented a batch processing approach where genomes were loaded, transformed, and fed to a convolutional autoencoder on the fly.

This not only clustered the sequences effectively but revealed something fascinating—distinct mutation patterns specific to each cluster. When visualizing the changes across the dimension that sorts the sequences, these patterns became clearly visible, suggesting specific evolutionary pathways the virus follows to adapt to environmental pressures.

Why This Matters for Data Scientists


This dimensional transformation approach offers several practical advantages for anyone working with large sequential data. Perhaps most importantly, it demonstrates how rethinking data representation can sometimes be more productive than endlessly tweaking model architectures.


While I've focused on genomic applications, this approach has potential anywhere we deal with long sequences including protein structures which I'm excited to get into in the near future. The key insight is that dimensionality can be a feature, not a limitation. By reshaping our data thoughtfully, we open up new architectural possibilities and processing efficiencies.


More broadly, I believe we've only scratched the surface of what's possible with dimensional transformations of sequential data. As we continue to face challenges with increasingly large sequences, rethinking our fundamental data structures may prove more fruitful than continually scaling model architectures.

By William McNamara October 15, 2024
ColabFold has changed the game for amateur protein folding analysis
By William McNamara February 3, 2024
Getting started with Deepmind's revolutionary model for protein folding
By William McNamara December 22, 2023
A continuation of genome sequencing analysis
By William McNamara March 19, 2023
Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.
By William McNamara December 8, 2022
Evolutionary strategies for feature engineering
By William McNamara December 2, 2022
Another recap of analytical methods
By William McNamara September 22, 2022
Another experiment with NASA data
By William McNamara September 17, 2022
creating synthetic data for incomplete NASA dataset
By William McNamara August 20, 2022
An exploration of different symptoms experienced after the acute illness
By William McNamara August 1, 2022
Speech given at the University of Virginia on July 31st, 2022
Show More