Methods for Genetic Classification of Covid (Part 3)

William McNamara • June 2, 2022

In a series of posts, I used some sampling schemes to preprocess large biological sequences. Particularly the SARS Cov2 sequences, mainly due to their availability from the NCBI SARS Cov2 resources site.


I used two representation schemes, a frequency-based and a graph-based scheme. In the frequency-based, the sequences were divided into overlapping fragments. And the frequency of those fragments was used to find a low-dimensional representation of the sequences. As the overlapping fragments resembled the construction of a de Bruijn graph I just extended the idea by using different graph construction schemes.

Both schemes create a small representation of the sequence, but at the current stage is not possible to recreate the original sequence. However, is possible to get a general overview of the sequence with scarce computational resources.

Applying a PCA or a variational autoencoder VAE to those representation schemes results in a series of clusters with a strong temporal component.


(And from this point on and in the following posts I will refer to sequence encodings as either the frequency-based or graph-based sequence representations. The learned representation will refer to the bottleneck in the VAE or other network. And composition will refer to the frequency-based representation of the sequence. This distinction is made as the single element frequency matches the content of the different nucleotides in the sequence. In that case, the value has a well-defined physical meaning. While the remaining values are not clear. )


Thus the SARS Cov2 sequences contain some sort of seasonal clock inside the sequence. Although this seasonal clock can be a side effect of the sampling bias, the number of isolates for sequencing is about 10 to 20 times higher in the second year of the pandemic. Removing such sampling bias by subsampling the sequences showed simar results, representations with a strong temporal component.

A VAE is constructed by an encoder and a decoder network, the encoder yields the learned representation. While the decoder returns an approximation of the original data point. The decoder network also works as a generative model and offers a way to approximate changes inside the input. Thus changes or properties that yield the temporal component can be traced back by analyzing selected points inside the learned representation rather than the whole dataset. Specific patterns can be obtained by analyzing the characteristics of the VAE latent walk.

The clock inside the sequences is encoded by the change in the frequency of different fragments of 4 bases inside the SARS Cov 2 genome. Also, the temporal information is encoded mainly in the structural components of the SARS Cov 2 genome. Yet this does not mean that the other parts of the viral genome cannot change. But rather those “constant” regions might follow another kind of pattern. Or the sequence encoding is unable to provide enough information to characterize such regions.

Plotting the frequency of those 4-bases combinations through time results in a wave-like pattern inside the plots.

However when instead of the isolation date as a measure of time I use the day duration or day length this wave-like behavior disappears.

The use of day duration as a measure of time was the result of several attempts to merge environmental information and the learned representations. Previous attempts showed an agreement between environmental variables with a wave-like pattern.

Using day duration as a temporal scale rather than the Julian day calendar started to show some particular useful characteristics. Most of the cases were confined to the extremes, on the min and max day duration at a particular location.

It also showed that the rate of change in day duration between consecutive days offered a way to approximate the start and the end of a COVID-19 wave at a particular location. This can be used to establish the relative transmission risk of COVID-19. Joining an environmental change to viral transmissibility, similar to abrupt changes in temperature and the flu and some other winter illness.


Why does the SARS Cov2 virus follow such a scale? is a question to which I have no concrete answer. Nevertheless, the SARS Cov2 genome is similar in composition to a series of genes expressed due to the action of VDR or vitamin D receptor. Vitamin D is produced due to exposure to solar radiation. Yet, it’s also similar to a series of other genes with apparently little involvement with solar radiation. Nonetheless, the temperature is correlated to the learned representation and also correlated to solar radiation. Day duration appears to work as a control variable by maintaining sequence composition constant, and day duration is correlated to solar radiation. And some genes similar to SARS Cov2 are regulated by solar radiation. Thus I think is safe to assume that solar radiation has a role in COVID-19 temporal adaptation. It might not be the complete picture, but an important part of it.


By William McNamara October 15, 2024
ColabFold has changed the game for amateur protein folding analysis
By William McNamara February 3, 2024
Getting started with Deepmind's revolutionary model for protein folding
By William McNamara December 22, 2023
A continuation of genome sequencing analysis
By William McNamara August 1, 2023
Introductory methods for genome sequencing
By William McNamara March 19, 2023
Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.
By William McNamara December 8, 2022
Evolutionary strategies for feature engineering
By William McNamara December 2, 2022
Another recap of analytical methods
By William McNamara September 22, 2022
Another experiment with NASA data
By William McNamara September 17, 2022
creating synthetic data for incomplete NASA dataset
By William McNamara August 20, 2022
An exploration of different symptoms experienced after the acute illness
Show More