William's Blog

Building an Artemis II Tracker

Tue, 21 Apr 2026 03:53:31 GMT

How I tracked humanity's return to the Moon with Python, Streamlit, and a handful of free APIs

When NASA announced that Artemis II would carry the first crewed lunar mission since Apollo 17 in 1972, I wanted to do more than just watch the launch. I wanted to build something — a live dashboard that would let me (and anyone else) follow the mission the way Mission Control does, with real numbers updating in real time. No accounts, no API keys, no paywalls. Just open data and a Python script.

The Idea: What Would Mission Control Actually Show?

The first question I had to answer was: what data is even publicly available for a crewed NASA mission? The answer was more than I thought, but less than I hoped.

On the "more than I thought" side: NASA's Jet Propulsion Laboratory runs a system called Horizons, which is essentially a live database of the positions and velocities of every tracked object in the solar system, including active spacecraft. So cool! Artemis II's Orion capsule has an official JPL target ID: -1024. That means I could query its exact position, distance from Earth, and velocity at any moment, for free, with no authentication required.

On the "less than I hoped" side: crew biometrics, cabin pressure, CO₂ levels, and propellant remaining; all of that is monitored internally by Mission Control in Houston and none of it is released publicly during an active mission. Boo! I made peace with that early and decided to focus on what I could show and show it beautifully.

The final feature list came together around four pillars: orbital tracking from JPL Horizons, space weather and crew radiation data from NOAA, live imagery from GOES-16 and NASA's Solar Dynamics Observatory, and lastly I built a 3D WebGL scene just to make it all feel like a real mission control room.

Setting Up the Project

The stack is deliberately minimal. The whole application is a single Python file, app.py , with three dependencies: Streamlit for the web interface, Plotly for charts, and Requests for HTTP calls. That's it. No database, no background workers, no message queues.

I also wrote a small installer script to handle environment setup cleanly.

Running python3 install.py once creates a self-contained virtual environment and installs everything. After that, launching the dashboard is a single command. I wanted the setup experience to be frictionless for anyone who wanted to run it themselves.

The Heart of It: Querying JPL Horizons

The most technically interesting part of the project was getting live orbital data out of JPL Horizons. The system has a REST API, but its response format is designed for astronomers; it returns a structured text block with headers, separator lines, and a data section bracketed by $$SOE and $$EOE markers.

I queried two objects simultaneously on every refresh: the Orion capsule ( -1024 ) and the Moon ( 301 ). Getting the Moon's live position was important; the distance between Earth and the Moon varies by about 50,000 km over its orbit, and I wanted the "percentage of the way to the Moon" calculation to be accurate rather than using a fixed average.

The quantities 1 and 20 give me the spacecraft's right ascension and declination (its angular position in the sky) plus its range from Earth in astronomical units and its range-rate — the rate at which that distance is changing, which is the radial velocity. From those four numbers I could derive everything else: distance in km and miles, velocity in km/s and mi/s, altitude above Earth's surface, and light travel time.

Parsing the response required a small trick. Horizons sometimes inserts a variable number of *** separator lines between the column header and the data block, which breaks naive parsing. I solved it by walking backwards from the $$SOE marker to find the first line with at least four commas — that's always the header :

With the position and angles in hand, I could convert the spherical coordinates (RA, Dec, distance) into Cartesian XYZ coordinates in kilometers — which I'd need later to place the spacecraft correctly in the 3D scene.

Derived Metrics: Making the Numbers Meaningful

Raw astronomical units and right ascension coordinates aren't what most people want to see on a dashboard. I built a layer of conversions to turn the Horizons output into human-readable metrics:

I also built a mission phase detector that uses the spacecraft's distance relative to the Moon's current distance to determine where in the journey it is:

Space Weather: Protecting the Crew

One thing that makes a crewed lunar mission genuinely different from robotic missions is radiation. Beyond Earth's magnetosphere, the crew is exposed to solar energetic particles and galactic cosmic rays with no natural shielding. Space weather isn't just an interesting sidebar — it's a crew safety issue.

NOAA's Space Weather Prediction Center publishes real-time data from the DSCOVR satellite (parked at the L1 Lagrange point, about 1.5 million km sunward of Earth) and the GOES geostationary satellites. All of it is free JSON, no API key required, updating every minute or two.

I pulled five endpoints on every refresh:

Planetary Kp Index — a measure of geomagnetic disturbance, 0–9 scale
Solar wind speed and density — from DSCOVR's Faraday cup
Interplanetary Bz — the north-south component of the solar magnetic field; a sustained negative Bz drives geomagnetic storms
X-ray flux — used to classify solar flares from A through X class
Proton flux >10 MeV — the NOAA S-scale radiation storm indicator

The flare classification logic mirrors the official NOAA scale:

And the S-scale radiation storm levels map onto proton flux thresholds from S1 (Minor, >10 pfu) through S5 (Extreme, >100,000 pfu), each rendered with an appropriately alarming color.

Live Imagery

The dashboard pulls two live satellite images that update automatically — no intervention needed.

Earth comes from GOES-16, NOAA's geostationary weather satellite parked over the Americas at 35,786 km altitude. NOAA's NESDIS division publishes the GeoColor full-disk composite at a public URL that updates every ten minutes. It's a single st.image() call.

The Sun comes from NASA's Solar Dynamics Observatory, which has been imaging the Sun continuously since 2010. The 193-angstrom AIA channel shows the solar corona at about 1.5 million degrees Celsius — active regions appear bright, and solar flares are visible as sudden brightenings. NASA publishes the latest image at a stable URL that updates every fifteen minutes.

The 3D Scene: WebGL Inside Streamlit

The most visually striking part of the dashboard is a live 3D rendering of the Earth-Moon system with the Artemis spacecraft shown at its actual position. I built this using Three.js, the JavaScript WebGL library, and embedded it in Streamlit using components.html() .

The trick is passing live Python data into the JavaScript scene. I serialized the orbital positions into a JSON payload and injected it as a JavaScript constant:

Inside the Three.js scene, everything is built procedurally from Canvas 2D textures — no image files required. Earth gets a multi-layer texture with ocean gradients, painted continents, polar ice caps, a cloud layer, and a night-side emissive map showing city lights. The Moon gets a hand-painted texture with maria (the dark basaltic plains), highland regions, and impact craters. Both textures are generated fresh in the browser on every load.

The spacecraft itself is rendered as a triple-layer pulsing orange glow — an outer halo, a mid-layer, and a bright inner core — with a white point at the center and a yellow velocity vector pointing in the direction of travel. 7,000 stars surround the scene, distributed using the golden-angle spiral method for uniform coverage of a sphere, with vertex colors spanning blue-white, white, and warm yellow to approximate the real distribution of stellar spectral types.

Camera control is handled with spherical coordinates. The scene auto-orbits slowly when idle; dragging pauses auto-orbit and gives manual control; scrolling zooms. The camera always looks at the origin (Earth's center), and the zoom is clamped so you can't zoom inside Earth or past the Moon.

Caching: Keeping It Fast and Polite

The dashboard uses Streamlit's @st.cache_data decorator to avoid hammering the APIs on every page interaction. Different data sources have different refresh rates:

This means the dashboard feels instant for users — cached data is served immediately — while API calls happen at a sensible cadence in the background.

Deploying It

I wanted the dashboard to be always-on and publicly accessible, so I deployed it on a $6/month DigitalOcean Droplet running Ubuntu. The deployment stack is:

systemd to run Streamlit as a background service that restarts automatically on crashes or reboots
Nginx as a reverse proxy, forwarding port 80 to Streamlit's port 8501
Certbot for automatic HTTPS via Let's Encrypt

The systemd service file is the critical piece — it's what makes the dashboard "always-on" rather than dependent on a terminal window staying open:

Total monthly cost: $6. No sleeping, no cold starts, no usage limits.

What's Not There (And Why)

The dashboard intentionally includes a section explaining what it can't show. Crew biometrics, cabin telemetry, propellant levels — all of that exists, but it flows through NASA's internal Mission Control systems and has never been made available via a public API. This isn't a gap I could fill with creativity; it would require a direct NASA partnership.

I think it's worth being honest about this on the dashboard itself. A lot of "live tracking" sites paper over data gaps with stale numbers or estimates presented as live data. I'd rather show a clear "not available" than fake precision.

Closing Thoughts

The whole project took a weekend to build, costs $6 a month to run, and requires no API keys or paid data sources. Everything it shows is genuinely live — not simulated, not estimated, not cached from hours ago.

The most satisfying moment was watching the spacecraft position update in real time on the 3D globe and realizing that the numbers on screen represent four actual human beings traveling through deep space, further from Earth than any person has been in over fifty years. That felt worth building.

If you want to run it yourself, the full source is available and setup takes about ten minutes. All you need is Python 3.10+ and a terminal.

Hours to Minutes: Scaling User Clustering with Topological Manifold Learning

Fri, 19 Sep 2025 23:29:25 GMT

How I made a critical segmentation algorithm 9x faster with a new approach to clustering

If there is one topic I enjoy more than any other, it’s dimensionality reduction algorithms (which I have written about extensively). One of the most useful applications of dimensionality reduction is user segmentation, where machine learning can take a multitude of inputs and create better clusters of similarities than single variables that seem important but get stripped of their deeper context.

A couple of years ago, I built a user clustering algorithm that leverages t-distributed stochastic neighbor embedding (t-SNE) to predict how likely customers are to churn or expand their business. What I didn’t expect though was to run into scaling issues so quickly.

The Problem Statement

t-SNE has to consider every possible relationship between customers, but this thoroughness turns into a weakness. Imagine you’re looking at a city, and you ask every person in the city how similar they feel to another person in the city, and repeat this millions of times for every person in the city. That’s t-SNE, every time you add a new customer, the volume of calculations scales exponentially.

When I launched the program, it took a couple hours to run through our customer base, which meant running it once a day, a consolation I had come to live with. But every time we added more customers to our analysis, t-SNE's processing time didn't just increase - it exploded.

I needed to figure out how to achieve higher-performance clustering, ideally without sacrificing accuracy or run frequency.

UMAP Approach

With all of the buzz around the training of foundational models at companies like OpenAI and Anthropic, I’ve been familiarizing myself with Topological Manifold Learning which I’ve found to be a more intuitive way to think of optimizing algorithms than the traditional ways I was taught in school. In this context, topoliogical manifolds simply give greater consideration to the natural structures emerging in data. Returning to the city analogy, if instead of comparing every resident of the city to every other resident of the city, you asked how similar each person feels to their neighbor, their neighborhood in general, and the community of the broader city. It saves you and your algorithm a lot of time by preserving global and local structures that already exist.

Enter UMAP (Uniform Manifold Approximation and Projection) modeling, which represents a fundamental shift in approaching dimensionality reduction for user clustering. Instead of comparing every user to every other user like t-SNE does, UMAP builds a network graph where each user connects primarily to their nearest neighbors, then uses cross-entropy optimization to find a low-dimensional representation that preserves both these local neighborhoods and the global structure of how different user segments relate to each other.

This approach delivers dramatic computational advantages. Theoretically it reduces processing time while producing more stable and consistent results for business impact. So I decided to test this out!

Measuring Performance

Evaluating the comparative performance of UMAP will require a multi-dimensional approach that goes beyond simple speed comparisons to ensure the migration delivers genuine business value. Processing time should definitely improve, but I will also need to validate that faster clustering doesn't come at the cost of insight quality or stability; which even if faster could harm business decision making.

I’ll measure four critical dimensions of computational efficiency:

Processing Time
Clustering Quality
Business Impact

This comprehensive approach involves running both algorithms on identical datasets with optimized parameters, conducting statistical significance testing to ensure observed differences aren't due to randomness, and validating that technical improvements translate to measurable business outcomes like better personalization performance or more actionable user segments.

Assessing Processing Time

I’ve built an evaluation framework that tests both algorithms across progressively larger datasets from 1,000 to 100,000 users using identical hardware configurations and multiple runs to account for statistical variation. I chose to measure not just the core clustering time, but the complete end-to-end pipeline including data loading, preprocessing, and post-processing phases, since the clustering algorithm represents only one component of the total user segmentation workflow.

The testing protocol includes statistical significance analysis using paired t-tests and confidence intervals to ensure observed speedups are genuine rather than measurement noise, while also documenting memory usage patterns and identifying the scalability limits where each algorithm begins to fail. And most importantly, I validated results under realistic production conditions with simulated background system load, since benchmark performance in isolation can differ significantly from real-world deployment scenarios. This comprehensive measurement approach allowed me to quantify not just the dramatic time improvements, which measured at 75% runtime savings, but also understand the practical implications for infrastructure costs, pipeline frequency, and data scientist productivity

Assessing Clustering Quality

Assessing embedding quality represents a more complex challenge in dimensionality reduction evaluation because unlike performance metrics where faster execution times provide unambiguous improvement, embedding quality can have conflicting objectives that must be balanced across different analytical priorities.

The evaluation framework I have built addresses this complexity with six dimensions of assessment:

Local Structure Preservation : This metric ensures that users who exhibit similar patterns, engagement levels, or demographic characteristics stay grouped together during dimensionality reduction.
Global Structure Preservation : Understanding how different user segments relate to each other is crucial for strategic decisions. If high-spend users and budget users are naturally distant in behavior space, this should be reflected in the visualization.
Intrinsic Clustering Quality : This directly measures how actionable your user segments will be. Better clustering quality means clearer segment boundaries and more distinct user personas.
Ground Truth Validation : This provides the most direct assessment of whether the embedding preserves the user behavioral patterns you expect to find, ensuring insights derived from visualization accurately reflect genuine segmentation patterns.
Density Preservation : High-value customers might be sparse and unique, while mainstream users cluster densely. Preserving these patterns is crucial for understanding market structure.
Topological Structure Preservation : User behavior often follows continuous pathways - freemium to premium, casual to power user, young adult to family-oriented. Topology preservation ensures these natural progression routes are visible in the embedding.

These six metrics collectively balance the different mathematical objectives of dimensionality reduction algorithms. Running the result, I would expect to see t-SNE optimized for cluster separation and UMAP optimized for the more structural metrics which often provide more actionable business insights.

The assessment revealed UMAP as the mostly superior algorithm in 75% of run tests. UMAP consistently exceled in maintaining local neighborhood relationships (showing 2-5x better preservation), density patterns (4-13x superior), and topological connectivity, while both algorithms achieved perfect ground truth validation scores.

Although t-SNE maintains an advantage in pure clustering quality, UMAP's dominance in structural preservation metrics proves more valuable for real-world user clustering scenarios, and i would expect that gap to only widen at scale.

Business Impact

Alright now why are we doing any of this? The benefit of cluster analysis is to be able to segment data points into distinct clusters such that they might receive different business treatment. In my case this algorithm is able to identify 5 distinct clusters that describe different kinds of users. This is gold to a product team which can then customize the user experience to address the needs of their specific segment. See below the detailed cluster analysis where a random sample of 100 users fit very effectively into five groups.

While scatter plots effectively show cluster assignments, density overlap analysis reveals the probabilistic boundaries and confidence regions that standard visualizations miss.

By applying Gaussian kernel density estimation to each cluster and plotting the resulting contours, I can see where the clusters truly begin and end, rather than just showing point locations. The insight here is in the contour intersections. When you see a non-overlapping contours in indicates a clean separation with high classification confidence, but when you see an overlapping region such as between Clusters 1 and 5 it is suggesting a potential misclassification zone where data points could reasonably belong to either (or multiple) clusters. This is particularly valuable for real-world applications where understanding uncertainty is crucial. There is also insight in the contour shapes. I'm generally seeing circular patterns which suggest to me well-defined and homogeneous groups. If the shapes were more irregular it might indicate subclusters or the need for parameter adjustment.

Conclusion

Cluster analysis is a useful tool for analyzing user behavior, but at scale it can become computationally expensive. Switching from a t-SNE to a UMAP algorithm introduced topological manifolds that better preserve the existing structures and saves the algorithm a lot of work. This has allowed my program to run 9x faster while preserving and in some cases enhancing the embedding quality of the segment assignments. UMAP may not be superior in every use case though so I recommend following an evaluation methodology such as I have described if you're intending to do this in your business. That said, I'm a UMAP convert!

Joe Biden's AI Legacy

Wed, 11 Dec 2024 02:35:18 GMT

Not much to undo anyway

With the Biden Administration soon coming to an end, I wanted to look back at the progress made in the last couple years in the tech space, specifically the meaningful regulation of Artifiicial Intelligence.

To recap the timeline, the Biden Administration took three major steps:

(1) the Blueprint for an AI Bill of Rights, which created a rubric for the effective governance of Artificial Intelligence.

(2) the Executive Order on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence which was the legal setting in practice of administrative policy informed by the blueprint.

(3) The National Security Memorandum on Artificial Intelligence, which fulfilled a directive set forth by the Executive Order to craft formalized strategy and policy for Artificial Intelligence in the National Security space.

Executive Order

I’ve written previously about how the Executive Order was and remains to this day the most significant policy initiative in the AI regulation space. It both encouraged the development and use of AI applications across the executive branch, while establishing guardrails to protect against some of the negative outcomes of algorithmic approaches.

In my opinion, the most consequential and long lasting product to come out of the Executive Order is the Office of Management and Budget (OMB) guidance on the management of AI systems in federal agencies. The guidance extends the EO’s directives on civil rights and critical safety by providing definition to the types of applications that will be considered within the scope of those areas. This gives the Department of Justice something to work with in investigating potential civil rights violations that arise from the use of algorithmic systems.

National Security Memorandum

The Biden Administration has not taken a strong stance on the use of lethal autonomous weapons. In 2023 there was a Political Declaration by the Biden State Department that the military use of AI should be in compliance with international humanitarian law. To be clear this was not a binding action in any way so much as a “hey guys we should do this” message to other nations.

The Executive Order then later reiterated the Political Declaration while setting no administrative restrictions on the development or deployment of lethal autonomous weapons. I thought at the time that this was a missed opportunity that was then missed again with the National Security Memorandum, which focused instead on what we should be doing rather than what we shouldn’t be doing, while briefly referencing again the non-binding political declaration.

It is disappointing that the Biden Administration made no effort to back up their loose declaration that lethal autonomous weapons are an area for concern. 60 countries have singed the Political Declaration committing to a shared desire to move on this issue, but there was no leadership to create any law international or otherwise to curb lethal autonomous weapons.

What comes next?

The Executive Order was a good start, and created a rubric for future administrations sympathetic to this issue to work with, unfortunately that is not what we are about to get. President-Elect Trump has already indicated that he will revoke the Executive Order and likely every other memoranda and guideline the Biden Administration has created around this issue in compliance with a broader deregulatory stance he has in most areas. This is a bad indication for any meaningful progress toward AI governance for at least the next four years, but doubly unfortunately there isn’t much to undo anyway. The Biden Administration made progress, but fell embarrassingly short on this issue, and we are entering 2025 with no meaningful law to control the development of AI.

Exploring Alphafold Model (Part 2)

Tue, 15 Oct 2024 21:41:55 GMT

I've written previously about my enthusiasm for implementing and using Deepmind's Alphafold model for protein folding. AlphaFold2 finally achieved what many thought impossible: predicting protein structures with near-experimental accuracy from sequence alone. But as an independent researcher without access to massive computational resources, I needed a more accessible way to harness this technology. Fortunately I have discovered ColabFold, a brilliant adaptation that makes cutting-edge protein structure prediction easy for just about anyone with an internet connection.

ColabFold

My first experience with ColabFold was one of the easiest experiences I've ever had with coding. I navigated to the notebook, pasted in the protein sequence I'd worked with previously, clicked "Run all," and within minutes, I was looking at a detailed 3D model that would have taken years to determine experimentally. While the interface is deceptively simple, under the hood incredibly sophisticated analysis is happening.

Preparing My Experiment

For my first prediction, I decided to use a relatively small protein with 56 amino acids:

PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK

This is a bacterial transcription factor that I've been studying for a while, but whose structure remains experimentally undetermined. One of the aspects I appreciate about ColabFold is how it handles various options. I decided not to use AMBER relaxation or any templates initially—meaning the prediction would be based purely on the sequence and evolutionary information, without reference to any known structures.

Behind the scenes, ColabFold installs all the necessary dependencies—and there are quite a few. Considering i spent the better part of a week implementing all the dependencies when I first started with Alphafold, shielding users from that complexity in this way is a pretty great feature of ColabFold.

Generating Multiple Sequence Alignments

This is where ColabFold truly shines compared to the original AlphaFold implementation. The standard AlphaFold pipeline uses HHblits to search against enormous databases like BFD and MGnify, which can take hours or days and requires terabytes of storage.

ColabFold uses MMseqs2, a much faster homology detection method, searching against databases like UniRef and environmental sequences. For my protein, I chose "mmseqs2_uniref_env" which searches both UniRef and environmental sequence databases for homologs. Again, a million times faster than my first implementation.

The evolutionary information captured in these Multiple Sequence Alignments (MSAs) is crucial for accurate prediction. As proteins evolve, certain positions change in coordinated ways to maintain structure and function. These coevolutionary patterns provide powerful clues about which amino acids are likely to be spatially close in the folded protein. When I ran this step, ColabFold found hundreds of related sequences for my protein—a good sign that the prediction would be reliable. I could see this visually in the MSA coverage plot that showed evolutionary conservation across the protein's length.

Running the Prediction

Time for the meat of the analysis. ColabFold feeds the MSA and configuration into AlphaFold's neural networks, which have been trained on the entire Protein Data Bank plus additional structures . For my protein, the process took only about 5 minutes on the Google Colab GPU—orders of magnitude faster than experimental methods and even multiple times faster than my previous Alphafold implementation.

When the prediction finished, I was greeted with a stunning 3D visualization of my protein, colored by confidence. The predicted structure showed a compact globular fold with several alpha helices—typical for a DNA-binding protein.

In addition pretty visualization; it provides rich data to assess the reliability of the prediction. The model produces a per-residue confidence score called pLDDT (predicted Local Distance Difference Test), ranging from 0-100, with higher values indicating greater confidence. My protein showed high confidence (70-90) across most of its length, with slightly lower confidence at the termini—exactly what you'd expect, since protein ends usually have more iterations.

The visualization uses color coding to make this immediately apparent:

Blue regions (90-100): Very high confidence

Light blue (70-90): Confident

Yellow (50-70): Low confidence

Red (<50): Very low confidence

For multimeric proteins, ColabFold also provides Predicted Aligned Error (PAE) plots that show confidence in the relative positioning of residues—crucial for assessing interface quality in protein complexes.

ColabFold something that surprised me is that it didn't just give me one prediction; it provided five ranked models. I could examine each one by changing the rank number in the visualization cell. For my protein, the top-ranked model had the highest average pLDDT score, but the other models showed similar overall folds with minor variations in loop regions—consistent with what we know about protein dynamics.

Final Thoughts

Exciting as all of this is, I had to ask myself if this was categorically better than Alphafold? And the more i learned the more complicated that answer got. One limitation is that Google Colab assigns different GPUs with varying memory limits, so sometimes a long protein or complex will exceed available memory. Also while MMseqs2 is faster, it sometimes finds fewer homologous sequences than the full AlphaFold pipeline which includes other databases, which could affect the prediction accuracy.

It seems like ColabFold sacrifices a little bit of accuracy for a lot of speed, ease, and meta-analysis. Which is great! But if I'm entering into a competition or writing a research paper, I'd probably still use the full AlphaFold pipeline locally to get the best results.

But if lime me you're exploring protein folding as a passion project, or even if you're actually working with proteins in any capacity—whether you're a researcher, student, or educator—I highly recommend giving ColabFold a try. The barrier to entry is minimal, and the potential insights are enormous. As the authors of ColabFold eloquently put it, their goal is "making protein folding accessible to all." In my experience, they've succeeded brilliantly.

As always my code can be found on my GitHub here .

Exploring Alphafold Model (Part 1)

Sat, 03 Feb 2024 16:46:19 GMT

If you've been following the computational biology space like I have, you've probably heard about AlphaFold and the protein folding problem. As a data scientist who's been fascinated by this intersection of deep learning and molecular biology, I wanted to explore whether I could build a simpler but effective protein structure prediction model on my own. In this post, I'll walk you through my journey of creating an attention-based model that can predict protein structures from amino acid sequences.

The Protein Folding Challenge: Why It Matters

Proteins are the workhorses of life, responsible for virtually every biological process in our cells. Each protein has a unique 3D structure that determines its function, but experimentally determining these structures is painfully slow and expensive—sometimes taking years and millions of dollars.

If we look at a protein closely, it's essentially a string of amino acids (like beads on a necklace) that folds into a complex three-dimensional shape. While we know the sequences of billions of proteins, we've mapped far fewer of their structures. This gap represents one of the grand challenges in computational biology. That's where computational approaches like AlphaFold come in—they can predict a protein's structure from its amino acid sequence alone, in minutes rather than years.

Alphafold

Getting AlphaFold running on Google Cloud presented some unique challenges. The original system was designed for powerful research clusters, not the more constrained environment of a Collab notebook. I had to make some careful adaptations to work within these limitations. The trickiest part was getting the OpenMM physics engine to work correctly, but once I figured that out the rest fell into place.

The real magic of AlphaFold comes from how it leverages evolutionary information. When a protein evolves, certain positions remain conserved if they're critical for structure or function. By analyzing many related protein sequences, AlphaFold can infer which positions are likely to be close to each other in 3D space. To gather this evolutionary information, we need to search large sequence databases. In my cloud implementation, I use three key databases: the 'Universal Reference Cluster' and 'MGnify' databases provided by the European Bioinformatics Institute; and the 'Big Fantastic Database" created by Martin Steinegger and Johannes Söding.

For each protein sequence we want to analyze, we use a tool called Jackhmmer to search these databases for homologous sequences. This creates what's called a Multiple Sequence Alignment (MSA) which is essentially a matrix showing how amino acids vary at each position across evolutionarily related proteins. One fascinating thing about this process is watching how the search finds distant evolutionary relatives. For some well-studied proteins, I found thousands of related sequences; for others that are more unique there were only find a handful. This directly impacts prediction accuracy as more related sequences generally means better predictions.

With our MSAs in hand, we're ready to run AlphaFold itself. The model has several key components:

Evoformer blocks that process the MSA and extract evolutionary patterns
A structure module that converts this information into 3D coordinates
A confidence predictor that estimates how accurate each part of the prediction is

What's happening inside is super interesting, the neural network is essentially learning the physical rules that govern protein folding, without being explicitly programmed with those rules. The output includes not just the 3D coordinates, but also confidence scores that tell us which parts of the prediction we can trust. The main confidence metric is pLDDT (predicted Local Distance Difference Test), which ranges from 0-100. This provides a visual sense of which regions are likely correct and which should be taken with a grain of salt.

Predicting My First Structure

For my first prediction, I decided to start with a fairly simple protein, a small zinc finger protein with just 74 amino acids:

MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

After entering this sequence and hitting "Run," I watched AlphaFold go through its paces:

First, it searched the sequence databases, finding about 200 related sequences
Then it processed these sequences through the neural network
Finally, it produced a relaxed 3D model of the protein

The moment of truth came when I visualized the result. As I would expect from a zinc protein, the protein folded into a compact structure with a clear alpha-helical pattern. The confidence scores were mostly in the high range (70-90), suggesting this was likely a reliable prediction. What's remarkable is that this entire process took just a few minutes on a standard Google GPU. A decade ago, this level of accuracy would have been considered impossible, and even five years ago it would have required massive computational resources.

Understanding the Predicted Structure

One of my favorite aspects of working with AlphaFold is the rich visualization and analysis tools. After generating a prediction, I can examine it from multiple angles. Using a pLDDT plot I can graph the confidence scores per residue to help identify flexible regions. For my zinc finger protein, the pLDDT plot showed high confidence in the core helical regions, with slightly lower confidence at the termini. This pattern makes biological sense to me, as protein ends are often more flexible and thus have more iterations.

Closing Thoughts

I'm very excited to continue experimenting with AlphaFold. Even more excited to see how these models are continually refined and applied to advanced problems in computational biology. Especially at a time when the morality of many machine learning applications could generously be described as questionable, it's nice to work with a model that will only lead to improved understanding of science and biology! Complete code can be found on my GitHub here .

Extracting Patterns from Genomic Data

Fri, 22 Dec 2023 15:31:58 GMT

Introduction to Time-Dependent Pattern Recognition

When working with complex datasets like the SARS-CoV-2 genome, what initially appears as random noise often contains underlying patterns governed by specific mechanisms. As data professionals, our challenge is to identify these hidden rules that drive modifications in the data. This case study explores how multiple time scales can reveal patterns in genomic data that would otherwise remain obscured in traditional analysis approaches.

The complexity of biological data presents unique challenges for data engineers and analysts. Genomic sequences—essentially long strings of A, C, G, and T nucleotides—contain embedded patterns that evolve over time in response to environmental pressures. Traditional statistical approaches often fail to capture these patterns because they operate under assumptions of stationarity or simple linear relationships.

In this analysis, I will demonstrate how techniques from data engineering, machine learning, and time series analysis can uncover hidden structure in seemingly chaotic genomic modifications. By approaching the problem as a multi-dimensional data challenge rather than purely as a biological one, I've discovered remarkable patterns that follow deterministic rules.

Multi-Scale Time Series Analysis

One of the most fundamental patterns I uncovered is time-dependence, but with an important insight: the choice of time scale significantly impacts pattern visibility . I analyzed the same genome size dataset across three different time scales:

Standard calendar time - showing dynamic fluctuations
Sunshine Duration (SD) - approximating daily solar radiation
Sunspot Number (NS) - tracking solar cycle activity

This multi-scale approach revealed that what appeared as random shrinking in standard time measurements actually followed consistent patterns when mapped to environmental variables. The apparent randomness resulted from the superposition of multiple cyclical patterns operating at different frequencies.

When plotting genome size against standard calendar time, I observed irregular shrinking patterns with no clear periodicity. However, when I transformed the time axis to represent solar radiation metrics (both daily sunshine duration and cyclical sunspot activity), coherent patterns emerged. This demonstrates a critical lesson for data engineers: your choice of coordinate system and time scale can make the difference between seeing noise and seeing signal .

Implementation Details

The data pipeline for this analysis required several key components:

Data collection layer : Gathering genomic sequences from public repositories and normalizing sequence lengths
Time series transformation module : Converting between different time scales using astronomical data
Feature extraction system : Calculating genome size and other relevant metrics
Correlation analysis engine : Detecting relationships between genomic changes and environmental variables

The superposition of these multiple time-dependent patterns creates what appears to be random behavior when viewed through conventional time measurements. From a data engineering perspective, this highlights the importance of domain knowledge in feature engineering—understanding potential environmental influences guided my choice of time transformations.

Dimensionality Reduction for Pattern Discovery

To analyze the yearly adaptation patterns, I applied a Variational Autoencoder (VAE) to K-mer frequency data. This dimensionality reduction technique allowed me to:

Organize genomes by latent space coordinates
Identify sliding patterns in genome fragments
Discover hotspots for potential mutations

The VAE approach transformed a high-dimensional problem (thousands of K-mer frequencies) into an interpretable latent space where patterns could be visualized and analyzed.

K-mer Representation for Sequence Analysis

For those unfamiliar with genomic analysis, K-mers are subsequences of length K extracted from a longer sequence. For example, the sequence "ACGTACGT" would yield the following 4-mers: "ACGT", "CGTA", "GTAC", "TACG", and "ACGT" (repeating). K-mer frequency analysis is a powerful technique for representing sequence data in a way that's amenable to machine learning.

In the pipeline, I implemented a sliding window approach to extract K-mers of varying lengths (1-4) from each genome, creating frequency vectors that captured the distribution of these subsequences. This transformation converted variable-length genomic sequences into fixed-length feature vectors suitable for deep learning.

VAE Architecture and Implementation

This VAE architecture consisted of several dense layers for both encoder and decoder components, with a latent space dimension experimentally determined to be optimal at 32. This dimensionality provided sufficient capacity to capture genomic variation while constraining the model enough to force meaningful representations.

By training this VAE on our K-mer frequency data, I effectively compressed thousands of dimensions into a manageable latent space where patterns and relationships became apparent. Visualizing this latent space through t-SNE further helped me identify clusters of similar genomes and track evolutionary trajectories over time.

However, ribosome shifting could return the genome to its intended reading frame, preventing early termination. Color coding the “slippery sequence” that signals ribosome shifting results in the following.

However, one of these patterns disappears or is introduced in a few of the selected sequences. Suggesting that the introduction of this pattern into the genome could be a rare but possible event.

Domain-Specific Feature Extraction

From a data engineering perspective, this highlights a critical principle: effective feature engineering requires domain knowledge . While general-purpose techniques like K-mer frequency analysis provide a solid foundation, incorporating domain-specific features dramatically enhances model interpretability and performance.

My feature engineering pipeline included several domain-informed components:

Slippery sequence detection : I implemented pattern matching algorithms to identify heptameric sequences matching the pattern X_XXY_YYZ (where X, Y, and Z represent nucleotides) known to cause ribosomal frameshifting.
Open Reading Frame (ORF) analysis : I developed a sliding frame analysis tool that computed potential protein products across all three reading frames, tracking how mutations might alter protein expression.
Restriction site identification : I mapped recognition sequences for specific restriction enzymes across the genome, creating feature vectors that captured the distribution of these sites.

By visualizing these domain-specific features alongside the latent space representations, I gained insights that would have been impossible with statistical approaches alone. For instance, I discovered that certain regions of the genome maintained consistent restriction site patterns despite mutations elsewhere, suggesting functional constraints on evolution.

Data Visualization for Pattern Discovery

The visualization of these domain-specific features required specialized approaches. I developed a custom visualization pipeline that:

Mapped genomic sequences to color-coded representations where each nucleotide received a distinct color
Aligned sequences along specific axes based on latent space coordinates
Highlighted domain-specific features with overlay markers
Generated interactive visualizations that allowed for dynamic exploration of the dataset

This visualization approach revealed subtle patterns in how genomic fragments "slide" relative to each other across variants, providing insights into the mechanics of genome reorganization.

Multi-Output Model Architecture

To correlate genomic changes with environmental variables, I modified the model architecture to produce multiple outputs:

Primary output: Reconstruction of original K-mer data
Secondary output: Environmental variable prediction (UVB radiation)

This approach allowed us to bind specific genomic patterns to environmental conditions by establishing relationships within the latent dimensions.

The UVB model shows that learned representation results in a more specific segmentation, as more clusters are obtained. However, the model heavily overfits the training data showing that the specific architecture might not be appropriate for the problem at hand and further optimization is needed to obtain better conclusions. Yet the model still can retrieve some relationship between the genome and the environmental data.

Interpretability Through Feature Correlation

By correlating specific K-mer patterns with UVB radiation levels, I created interpretable connections between genomic features and environmental variables. Each pattern approximated a yearly dynamical signature, providing actionable insights for potential applications.

I later expanded this to include atmospheric composition data, allowing the model to differentiate between direct solar radiation effects and atmospheric influences.

The model overfits its training data, but some patterns emerge from the analysis. The model can differentiate between solar radiation data and atmospheric composition.

Further optimization of the different models will enable better understanding of how the viral genome adapts to the environment; as well as identifying recurrent mutational patterns and how they affect the genome. Identification of recurrent patterns will fast-track the development of new treatments as well as the design of seasonal or general vaccines. And while most of the analysis is done using SARS Cov2 genomic data, different parts of the analysis can likely be repurposed for other viruses particularly non-fragmented ones.

Differential Response to Solar vs. Atmospheric Variables

One of my most intriguing findings was that the multi-output model learned to differentiate between direct solar radiation effects and those mediated by atmospheric composition. The model developed distinct latent space representations for these different influences.

Certain regions of the latent space corresponded to specific atmospheric conditions, while others were more responsive to direct solar radiation metrics. This finding has profound implications for understanding how different environmental factors drive genomic adaptation.

Actionable Insights from Atmospheric Analysis

The integration of atmospheric composition data yielded several actionable insights:

Differential susceptibility : I identified specific genomic regions more susceptible to atmospheric-mediated vs. direct solar effects.
Predictive signatures : By analyzing the latent space, I could predict which K-mer patterns would be more prevalent under specific atmospheric conditions.
Geographic implications : The model revealed how genomic adaptation might vary across different geographic regions with similar solar exposure but different atmospheric compositions.

These findings demonstrate the value of a comprehensive data engineering approach that integrates multiple data sources and uses sophisticated modeling techniques to uncover hidden relationships. As always the code for this post can be found by clicking here .

Biden's Executive Order on AI: It's a start

Tue, 07 Nov 2023 23:40:13 GMT

A good step but not nearly enough

Last week President Biden signed an Executive Order 144110 on the Development and Use of Artificial Intelligence, the most comprehensive framework for the regulation of AI ever to become law in the United States. I should note that the order is not an Act of Congress so the only people bound to follow are executive agencies, but what most interested me is that in contract to former President Trump’s Executive Order to promote AI development, Biden’s order is doing the same but with a measured caution to implement AI safely and responsibly.

I want to unpack what I see as a few critical gaps missed by the Biden Administration’s EO, and why I think these gaps will make it difficult to effectively regulate the areas the EO is trying to cover. I will cover these in a few focus areas.

The Civil Rights Division of the Department of Justice is directed to address algorithmic discrimination in federal technology systems

I was really glad to see this. It’s important to first accept that while AI adds efficiency to many systems, that efficiency can also lead to predictive determinations that discriminate against disadvantaged groups. For example, automated rejection from health care programs, criminal sentencing, housing programs, and credit checks; these are all things where everyone needs to be given their fair shot regardless of what an algorithm predicts for them.

The EO provides direction here, but is doesn’t specifically address how exactly algorithmic discrimination will be investigated. In many cases, these algorithms may be high dimensional black boxes which make it difficult to discern what criteria an algorithm is considering significant in a certain classification. Even AI engineers currently lack the capabilities to understand the assessments these AI are making, so in an adversarial regulatory scenario, I think the DOJ would have their work cut out for them to provide causal discrimination in any system.

Private companies will be compelled to share the results of safety tests on AI systems that are tied to national security or critical infrastructure

This is great, and a unique provision of the EO because I believe it is the only one compelling private companies to do anything. The only reason it is able to do this without being an act of congress is by invoking the Defense Production Act, which is also why this regulation is limited in scope to national security.

The problem here, is that private companies will not be compelled to do much of anything else to be in compliance with this EO. Civilian agencies are being asked to develop “voluntary standards” that will effectively be recommendations for the private sector as opposed to anything that are forced to follow. I think the large majority of AI use cases at private companies that actuall impact people’s lives will fall outside the scope of this EO. And that’s a shame.

The Department of Commerce is is directed to develop standards for watermarking AI generated content

Okay….and? This I think is the biggest failure of this Executive Order. This is effectively the provision giving lip service to disinformation and deepfakes becoming more ubiquitios in political discourse. But when the Department of Commerce creates these standards, companies will not be compelled to adopt them, effectively making them toothless. This EO doesn’t go nearly far enough with this issue, and maybe it can’t without Congress, but I do not see this as meaningful progress on what will continue to be a real and existential threat to our democracy.

Which brings me to my thematic conclusion here. I am so glad to see the Biden Administration recognizing the opportunities offered by artificial intelligence, as well as the threats. Artificial Intelligence will touch every area of our life, and an Executive Order couldn’t possibly cover all the areas that require aggressive regulation. I believe the purpose of this EO isn’t even to effectively address these problems within the executive branch, though I hope they will, it is a letter to congress outlining a few important focus areas that ultimately they will need to legislate on.

Dimensionality Expansion for Genome Sequencing

Tue, 01 Aug 2023 22:02:31 GMT

Joining my dimensionality expansion methods to my bioinformatics research

As data scientists, we've all hit that frustrating wall when working with language models or sequence analysis—the dreaded sequence length limitation. I've spent countless hours optimizing attention mechanisms and tweaking model architectures, only to gain marginal improvements in handling longer texts or genomic sequences. But what if we've been tackling this problem from the wrong angle all along?

Breaking Free from One-Dimensional Thinking

Recently, I've been exploring an alternative approach that has transformed how I work with large sequences. Rather than continually optimizing model architectures (which seems to be where most research focuses), I started questioning our fundamental representation of sequential data.

Think about it: why do we insist on representing sequences as one-dimensional arrays? In the physical world, we don't store large amounts of text in a single, unbroken line. Instead, we arrange words on pages and pages in books. This dimensional organization isn't just convenient—it's efficient.

This realization led me to a simple but powerful insight: what if we increased the dimensionality of our sequence representation?

From Lines to Planes to Volumes

Let me walk you through the conceptual progression: A single sentence is essentially a one-dimensional representation—a linear sequence of tokens. A page of text is two-dimensional, organizing that sequence into rows and columns. A book takes this further into three dimensions, stacking those pages into a compact volume.

With each increase in dimensionality, we gain a more efficient way to represent and transport the same information. This isn't just a physical convenience—it suggests a fundamentally different way to encode data for our models.

The Perfect Testing Ground: Biological Sequences

While working on a genomics project, I realized biological sequences offer the perfect testing ground for this approach for two key reasons. First, we have access to vast amounts of genomic data; and second, biological sequences (particularly DNA and RNA) have a remarkably limited alphabet. The four-letter alphabet of DNA (A, T, G, C) is particularly appealing from an encoding perspective. We can easily one-hot encode each nucleotide, but the magic happens in how we arrange these encodings. Instead of feeding a genomic sequence to a model as a 1D array, I reshape it into a 2D array—essentially creating an "image" where the (x,y) position encodes the sequence order and the channel dimension encodes the specific nucleotide at that position

Putting Theory Into Practice: The HIV Genome

To test this approach, I downloaded HIV genome sequences from the NCBI database. The average HIV genome is around 9,000 bases long, which conveniently reshapes into a 24×24×16 array (plus our channel dimension for the nucleotide encoding).

Since the final array has more elements than the typical genome, I applied a small amount of padding. With this transformation complete, I could now leverage architectures designed for image processing!

I trained a variational convolutional autoencoder on this reshaped data, which not only performed dimensionality reduction but also naturally clustered similar sequences together—all without requiring labeled data.

The most exciting part? The generative component of the model successfully reconstructed sequences, allowing us to compare them against fragments and evaluate reconstruction quality.

Expanding to Dengue and SARS-CoV-2

Encouraged by these results, I applied the same approach to Dengue virus genomes and observed similar clustering by sequence similarity.

The flexibility offered by expanding the number of dimensions results in a quick and simple method to analyze large sequences even when the dataset does not fit into memory. Data transformation can be done relatively quickly and the genomic sequences can be fed to the network by loading them in batches. This specific case can be found by analyzing the current pandemic virus.

The real test came when working with SARS-CoV-2 data. Given the volume of sequences, I implemented a batch processing approach where genomes were loaded, transformed, and fed to a convolutional autoencoder on the fly.

This not only clustered the sequences effectively but revealed something fascinating—distinct mutation patterns specific to each cluster. When visualizing the changes across the dimension that sorts the sequences, these patterns became clearly visible, suggesting specific evolutionary pathways the virus follows to adapt to environmental pressures.

Why This Matters for Data Scientists

This dimensional transformation approach offers several practical advantages for anyone working with large sequential data. Perhaps most importantly, it demonstrates how rethinking data representation can sometimes be more productive than endlessly tweaking model architectures.

While I've focused on genomic applications, this approach has potential anywhere we deal with long sequences including protein structures which I'm excited to get into in the near future. The key insight is that dimensionality can be a feature, not a limitation. By reshaping our data thoughtfully, we open up new architectural possibilities and processing efficiencies.

More broadly, I believe we've only scratched the surface of what's possible with dimensional transformations of sequential data. As we continue to face challenges with increasingly large sequences, rethinking our fundamental data structures may prove more fruitful than continually scaling model architectures.

The World of Multimodal Foundation Models

Sun, 02 Apr 2023 01:55:27 GMT

A primer on foundation models: what they are, how they've evolved, and where they're going.

The success of foundation models such as BERT , GPT-3 , CLIP , and Codex has generated increased interest in models that combine vision and language modalities. These hybrid vision-language models have demonstrated impressive capabilities in challenging tasks, including image captioning, image generation, and visual question answering. And now a new paradigm of video foundation models that learn from video data using the principles of foundation models has recently emerged.
‍

This blog post provides an overview of foundation models, large language and vision-language models, and video foundation models. I'll review the architecture of foundation models as well as their training, fine-tuning paradigm, and scaling laws. Additionally, I'll review how vision-language models combine the power of computer vision and natural language processing and how they are being used to solve complex problems. Finally, I'll introduce video foundation models and how they are revolutionizing the understanding and analysis of video data.
‍

Intro to Foundation Models‍

A foundation model is a type of machine learning model that learns from a wide range of data using self-supervision at scale . The idea is to create a model that can be used for many different tasks. By training on lots of data, the model can learn the general patterns in the data. When the model is used for a specific task, it can use this knowledge to quickly adapt.
‍

Foundation models use deep neural networks , which have been popular since 2012, and self-supervised learning , which has been around for almost as long. Recent improvements in both areas have allowed for the creation of larger and more complex models. These models are trained on massive amounts of data, often without explicit labels.
‍

The result is a model that can learn a wide range of patterns and relationships, which can be used for many tasks. This has led to significant improvements in natural language processing, computer vision, and multimodal AI. With foundation models, we can create one model that can be used for many tasks, rather than creating different models for each task. This will save time, resources and speed up progress in many fields.
‍

Transfer Learning

Traditional machine learning (ML) models are trained from scratch (if not almost) and require lots of domain-specific datasets to perform well. However, if you only have a small amount of data, you can leverage the benefit of transfer learning . The idea of transfer learning is to take the "knowledge" learned from one task and apply it to another task so that you don’t require as much data as you would if you were to train from scratch. For deep neural networks, pre-training is the dominant approach to transfer learning: you train the model on an original task (i.e, detecting cars on the street) and fine-tune it to another downstream task of interest (i.e, detecting a black Tesla Model 3).
‍

As you can imagine, this is a very useful mechanism for computer vision. Most commonly a computer vision algorithm will train a model on ImageNet, keep most of the layers, and replace the top three or so layers with newly learned weights. Alternatively one could fine-tune the model end-to-end. Some of the most popular pre-trained models for computer vision tasks include AlexNet , ResNet , MobileNet , Inception , EfficientNet , and YOLO .
‍

In natural language processing (NLP), pre-training was initially limited only to the first step: word embeddings . The input to a language model is words. One way to encode them as a vector (instead of a word) is through one-hot encoding . Given a large matrix of words, you can create an embedding matrix and embed each word into a real-valued vector space. This new matrix is reduced to the dimension on the order of a thousand magnitude. Theoretically those dimensions correspond to some semantic notion.

As an example, looking at a model built by Word2Vec , It looked at which words frequently co-occur together. The learning objective was to maximize cosine similarity between their embeddings. When you embed the words "king," "man," and "woman," you can do vector math to get a vector that is close to the word "queen" in this embedding space.
‍

It's useful to see more context to correctly embed words because words can play different roles in a sentence depending on their context. If you do this, you'll improve accuracy on all downstream tasks. In the last few years several models, including ELMo , ULMFiT , and GPT , have empirically demonstrated how language modeling can be used for pre-training. All three methods employed pre-trained language models to achieve state-of-the-art results on a diverse range of tasks in NLP, including text classification, question answering, natural language inference, coreference resolution, sequence labeling, and many others.
‍

Transformers: The Underlying Architecture For Foundation Models

Prior to Transformers, the state of the art in NLP was based on recurrent neural networks (RNNs), such as LSTMs and the widely-used Seq2Seq architecture , which processed data sequentially – one word at a time, in the order that the words appeared.
‍

The innovation delivered by transformers is to parallelize language processing. This allows all the tokens in a given body of text to be analyzed simultaneously, rather than in sequence. Transformers rely on an AI mechanism known as attention to support this parallelization. Attention enables a model to consider the relationships between words, even if they are far apart in a text, and to determine which words and phrases in a passage are most important to pay attention to.
‍

Parallelization also makes transformers much more computationally efficient than RNNs , allowing them to be trained on larger datasets and built with more parameters. Today's transformer models are characterized by their massive size.
‍

Vision Transformers

Convolutional Neural Networks have been the dominant architecture in the field of computer vision. However, given the success of Transformers in NLP, researchers started adapting this architecture to image data. Enter the Vision Transformer (ViT) architecture, which applies the encoder block of the Transformer architecture to the image classification problem.

The idea is to split an image into patches and provided the sequence of linear embeddings of these patches as input to a Transformer. Similar to tokens in the NLP setting, these image patches are treated as inputs. The architecture includes a stem that patches images, a body based on the Multi-Layer Transformer encoder, and a Multi-Layer Perceptron (MLP) head that transforms the global representation into the output label. The end result being that the ViT sets or exceeds state-of-the-art results on many image classification datasets while being relatively inexpensive to pre-train.
‍

But they aren't without their problems. One significant issue is that they have difficulty with high-resolution images because they require a lot of computing power, which increases rapidly with image size. Additionally, the fixed-scale tokens in ViTs are not useful for tasks that involve visual elements of varying sizes.
‍

Transformer Variants

A flurry of research work followed the original Transformer architecture, and most of them made enhancements to the standard Transformer architecture in order to address the above-mentioned shortcomings.

Swin Transformer s have become very useful in this regard, which are generic Transformers that can be applied to any modality. The Swin Transformer introduced two concepts: hierarchical feature maps and shifted window attention.
‍

The model uses hierarchical feature maps to enable advanced techniques for dense prediction. It achieves linear computational complexity by computing self-attention locally within non-overlapping windows that partition an image. This makes Swin Transformers a good backbone for various vision tasks.
The use of shifted windows enhances modeling power by bridging windows of the preceding layer. The strategy is also efficient in terms of real-world latency: all query patches within a window share the same key set, making memory access in hardware easier.
‍

Perceiver is another Transformer variant recently created by DeepMind that takes inspiration from biological systems . It uses attention-based principles to process various types of input, including images, videos, audio, and point clouds. It can also handle combinations of multiple types of input without relying on specific assumptions about the domain.

The Perceiver architecture introduces a small set of latent units that forms an attention bottleneck. This eliminates the problem of all-to-all attention and allows for very deep models. It attends to the most relevant inputs, informed by previous steps. However, in multimodal contexts, it is important to distinguish input from one modality or another. To compensate for the lack of explicit structures, the model associates position and modality-specific features with every input element, similar to the labeled line strategy used in biological neural networks.
‍

Large Language Models

Following the original Transformer paper, a flurry of innovation occurred as leading AI researchers built upon this foundational breakthrough - starting with the NLP domain.
‍

GPT and GPT-2 came out a few years ago. The name means “generative pre-trained Transformers.” They are decoder-only models and use masked self-attention . This means that at a point in the output sequence, you can only attend to two input sequence vectors that came before that point in the sequence. While GPT embeddings can also be used for classification, the GPT approach is at the core of today’s most well-known large LLMs, such as chatGPT.
‍

These models were trained on 8 million web pages. The largest model has 1.5 billion parameters. The task that GPT-2 was trained on is predicting the next word in all of this text on the web. They found that it works increasingly well with an increasing number of parameters.

BERT came out around the same time as Bidirectional Encoder Representations for Transformers. With 110 million parameters, it is an encoder-only Transformer designed for predictive modeling tasks and introduces the original concept of masked-language modeling . During training, BERT masks out random words in a sequence and has to predict whatever the masked word is.
‍

T5 (Text-to-Text Transformer) came out in 2020. The input and output are both text strings, so you can specify the task that the model supposes to be doing. T5 has an encoder-decoder architecture. It was trained on the C4 dataset (Colossal Clean Crawled Corpus), which is 100x larger than Wikipedia. It has around 10 billion parameters.

‍The Rise of Large Vision-Language Models

Thanks to the Vision Transformer architecture, there has been increased interest in models that combine vision and language modalities. These hybrid vision-language models have demonstrated impressive capabilities in challenging tasks such as image captioning, image generation, and visual question answering. Typically, they consist of three key elements: an image encoder , a text encoder , and a strategy to fuse information from the two encoders . I'll review here some of the most well-known models in vision-language model research over the past two years.
‍

In 2021, OpenAI introduced CLIP (Contrastive Language–Image Pre-training) . The input to CLIP is 400 million image-text pairs that were crawled from the internet. It encodes text using Transforms, encodes images using Vision Transformers, and applies contrastive learning to train the model. Contrastive training matches correct image and text pairs using cosine similarity.

With this powerful trained model, you can map images and text using embeddings, even on unseen data. There are two ways to do this. One way is to use a " linear probe " by training a simple logistic regression model on top of the features that CLIP outputs after performing inference. Alternatively, you can use a " zero-shot " technique that encodes all the text labels and compares them to the encoded image. The linear probe approach is slightly better.

To clarify, CLIP does not directly go from image to text or vice versa. It uses embeddings. However, this embedding space is extremely useful for performing searches across modalities.
‍

CoCa , or Contrastive Captioner , is another foundation model by Google that combines contrastive learning (CLIP) and generative learning (SimVLM). It uses an encoder-decoder architecture that has been modified and trained with both contrastive loss and captioning loss. This allows it to learn global representations from unimodal image and text embeddings, as well as fine-grained region-level features from the multimodal decoder.
‍

In late 2022, DeepMind created a group of Visual Language Models called Flamingo . These models can do many different things, even with just a few examples of input and output. They have two parts: a vision model that can understand visual scenes , and a language model that helps with reasoning . The models use their pre-training knowledge to work together. Flamingo models can also take high-quality images or videos thanks to a Perceiver architecture (discussed in the section on Transformers variants) that can analyze a large number of visual input features and produce a small number of visual tokens.

Thanks to these new architectural innovations, the Flamingo models can connect strong pre-trained models for vision and for language, handle sequences of mixed visual and text data, and easily use images and videos as input. The Flamingo-80B, the biggest version with 80 billion parameters, set a new record in few-shot learning for many tasks that involve understanding language, images, and videos.
‍

Microsoft, Google, and Open AI released their own versions of large vision-language models over the past few weeks, thereby propelling the trends toward multimodal AI further.
‍

Microsoft released Kosmos-1 , a multimodal language model that can perceive different modalities, learn context and follow instructions. The model generates text based on the previous context and handles text and other modalities using a Transformer-based causal language model. It was trained using various types of data and has performed well in different scenarios, including understanding and creating language, recognizing images, and answering questions based on images.
Google's PaLM-E is an embodied multimodal language model that can handle various reasoning tasks based on observations from different sources and using different embodiments, including internet-scale language, vision, and visual-language domains. The biggest PaLM-E model, PaLM-E-562B, has 562 billion parameters and can reason about different things without being trained beforehand, like telling jokes based on an image or doing robot tasks such as perceiving, talking, and planning.
Lastly, OpenAI’s GPT-4 is a large multimodal model capable of processing image and text inputs and producing text outputs. It scored 90th percentile on a simulated bar exam and 99th percentile (with vision) on Biology Olympiad.

‍

Conclusion

Foundation models are becoming multi-modal. As foundation models will eventually serve as the basis of all AI-powered software , developers will increasingly start with pre-trained foundation models and then fine-tune them on narrow tasks. However, the most difficult situations for these models are the "long-tail" events they have not seen before. These long-tail events will continue to be even more complex to solve under multi-modal settings.

Fun with Spotify's API

Sun, 19 Mar 2023 01:07:47 GMT

Solving the tragic expiration of good recommendations with a little Python.

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy.

But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it.

In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills.

Introduction to Spotify’s API

Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists.

You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff.

In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API.

NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.

Creating a Playlist

The first step will be one of the easier things you can do with this library which is create a new playlist. We’ll use this playlist to hold all of the songs that get recommended to me permanently (or until I decide what to do with them). For this you can use the user_playlist_create() function. You will need to pass in a few parameters like name and whether you want it to be public and collaborative.

Accessing Curated Playlists

The next step will be to pull a curated playlist and get a list of the tracks in that playlist. For this we will use Spotipy’s search() function which takes as input the name of the entity and the table you are searching. The result will return in a JSON format so you’ll have to unpack it to get to the details.

Adding Tracks to Playlist

The final step will be to add the tracks to your newly created playlist. For this we will loop the Track IDs from the curated playlist into an initiatilized list that we will then add to our created playlist using Spotipy’s playlist_add_items() function. Feel free to use my method below:

A New Scaling Paradigm from Google

Tue, 17 Jan 2023 15:14:18 GMT

Google's DeepMind Researchers have proposed a new framework for scaling AI Models that could be a game changer for the space.

‍

Moore’s Law

Generally, scaling laws predict a continued improvement in model quality as we continue to scale up the computational budget (e.g., bigger models or more data). Open AI investigated the scaling laws of Transformer language models a few years ago and showed that scaling laws are predictive of future performance. Their findings showed that performance is a function of data size, number of parameters, and compute size .

More specifically, the experiments revealed that the test loss follows a power law with respect to the model size, dataset size, and compute used for training, spanning trends over seven orders of magnitude. This suggests that the relationships between these variables can be described by simple equations, which can be used to optimize training configurations for large language models. Additionally, the experiments indicate that other architectural details, such as network width or depth, have minimal effects within a wide range.
‍

Based on the experiments and derived equations, larger models are significantly more sample efficient. In other words, optimal compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Since the publication of that Scaling Laws paper, there has been significant interest in scaling up language models. GPT-3 was one of the state-of-the-art models in 2020. It was 100 times larger than GPT/GPT-2, with 175 billion parameters. Due to its size, GPT-3 exhibits unprecedented capabilities in few-shot and zero-shot learning. The more examples you give the model, the better its performance will be. And the larger the model, the better its performance gets .

DeepMind's Discovery

Last year, DeepMind proposed the "Chinchilla" scaling laws to create compute-optimal models. This is a more accurate scaling law formula than the original one proposed by OpenAI .
‍

They trained over 400 language models with 70 million to 16 billion parameters on 5 billion to 500 billion tokens. By predicting the optimal amount of data given the number of model parameters, they derived formulas for the model and training set size. Most large language models are "undertrained," meaning they haven't seen enough data.
To verify this, they trained another large model, Gopher , with 280 billion parameters and 300 billion tokens. With Chinchilla, they reduced the number of parameters to 70 billion while increasing data fourfold to 1.4 trillion tokens. Despite fewer parameters, Chinchilla exceeded Gopher's performance, suggesting that model size and training tokens are equally important.

Scaling Vision Transformers from Google shows that the scaling law also applies to not only the NLP task but also the CV task . The authors conducted experiments with Vision Transformer models ranging from 5 million to 2 billion parameters, datasets ranging from 1 million to 3 billion training images, and compute budgets ranging from less than 1 TPUv3 core day to more than 10,000 core days. Their findings show that simultaneously scaling total compute and model size is effective. Increasing a model's size when additional compute is available is optimal.

Emergent Abilities of Large Language Models

‍Google recently published an important paper titled " Emergent Abilities of Large Language Models ," which explores the emergent abilities that are present in larger models but not in smaller ones. The paper examines research that analyzes the influence of scale, comparing models of different sizes trained with varying computational resources. For many tasks, the behavior of the model either predictably grows with scale or surges unpredictably from random performance to above random at a specific scale threshold (for instance, more than 70 billion parameters).

‍

Since the formal and empirical analysis of scaling laws, many more language models (LLMs) have been released. These models have achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Notable examples include Megatron-LM (8.3B params), GLaM (64B params), LaMDA (137B params), Megatron-Turing NLG (530B params), and PaLM (540B params).

These exciting discoveries will continue to grow the space and achieve more and more incredible results. I'm especially excited to see how this scaling will better equip large models for scientific tasks like image labeling, genome sequencing, and protein folding.

Evolutionary feature engineering

Thu, 08 Dec 2022 19:25:19 GMT

Feature engineering is an important step in training any classification model, but can I train my model to iteratively create its own features using evolutionary strategies? Hmm....

Inside a data set there different samples of a population, each sample contains several features that show us how each sample is unique from any other sample. Those features can be used to train and evaluate a machine learning model. However, often those features are not able to make an accurate prediction, thus new features are needed to create a better and more accurate model. In this post, I will show you how to create new features from a dataset for a classification task.

The data

The data consists of several features derived from audio files with two genres of music, pop music, and classical music. By counting the labels of each genre, we can see that the dataset is balanced.

Also, displaying the histogram of each feature in the data set we can observe that features like rmse, spectral bandwidth, and mfcc2, just to mention a few have what it looks like two distribution of values. That characteristic could be useful for the classification task.

Correlation among the different features show that tempo, beats mfcc18 and mfcc2 have the lowest correlation between the different features in the data set.

Now that we have a little bit of knowledge of the data set, we can begin to develop the evolutionary method for feature engineering.

The Feature strategy

As the data set contains only numerical features, we can apply a simple ratio as a feature skeleton, that function will help us to calculate the new feature.

As the ratio needs two features to be calculated, the evolutionary method will try to find the best combination of column indexes for the features to be calculated. For that, we create a function that generates a list of variable size lists filled with random numbers, with that strategy we can look also for a smaller model with a better performance. Then we change that index into pairs of values to calculate the new features.

With the index created, we calculate the new features from the index, then and process the new features to be used to train a random forest classifier. We train a min-max scaler from the skleran library and split the new data for training and testing, Finally, we train a random forest classifier from the sklearn library using the default parameters and calculate the ROC AUC as a measure of performance.

With all the functions in place to train the model and calculate the new features, we can apply the evolution strategies to optimize the model performance. In this example, we are going to use two strategies mutation and recombination. For the mutation strategy, we are going to use two different rules for mutation. The first one checks if the column index is of a determined size, if that its the case a random value in the column index will be deleted, otherwise a random element in the column index will be changed. That strategy will help us to reduce the model size. For the recombination strategy, a random column index will be selected from the population, then a random value from that column index will be inserted in another column index.

As the strategies are already defined, we can optimize the model, we train a population of column indexes and calculate the performance. Then we apply the evolution strategies to the initial population and calculate the performance. Finally, we update the population of column indexes only saving the ones with a better performance. Each time the evolution strategy is applied to the column index is called a generation.

By applying the evolutionary method to the classification task we can see that with only five generations we can find one model with high performance. And with twenty generations we can see that about one in five trained models has a high performance. Also by increasing the number of generations the average performance of the population also increases.

Now you can define a basic workflow for automated feature engineering using evolutionary strategies. And some basic techniques for exploratory analysis. The complete code can be found in my Github by clicking here . See you in the next one.

COVID 19 Adaptability: A Review of Analytical Methods

Fri, 02 Dec 2022 21:17:40 GMT

A virus is a biological agent that can only replicate inside a host. The host can be either a cell inside a multicellular organism or a single cell of a unicellular organism. By a series of mechanisms, the virus hijacks the molecular machinery inside the cell to replicate itself. Specifics of how the virus can hijack the cellular machinery will depend on the virus itself. Even so, the main strategy is to mimic cellular components. Tricking the cell into thinking that it’s doing its regular job.

Viruses consist mostly of two components, nucleic acids(DNA, RNA) and proteins. The nucleic acids contain the information needed to synthesize the components of a new viral particle(proteins). And the viral particle contains structural and non-structural elements. Structural elements make up the main body of the virus they protect the nucleic acids and aid in the entry into the host cell. While non-structural elements aid in hijacking the molecular machinery of the host cell.

Once the virus hijacks the machinery it manipulates them to make copies of the virus and release them into the extracellular space. This allows the virus to extend into different hosts, in the case of a unicellular host. Or to disseminate to other cell types and through the entire host in the case of a multicellular host. Eventually, the virus will escape and infect another susceptible host, and the cycle repeats. The release of viral particles often results in the destruction or lysis of the cell. Resulting in the death of a unicellular organism, or illness and disease in the case of a multicellular organism.

Why the emergency?

Because of their simplicity viruses can spread through different hosts with ease and cause illness and disease. The nature of the disease caused by the virus and its transmissibility will depend on the virus itself. However, sustained transmission within a human population raises the alarm of a virus with probable pandemic potential.

Viruses that could pose a threat to society due to their epidemic potential are included in a list curated by the world health organization. The list contains a subset of known viruses and it aims to prioritize research into highly pathogenic viruses. Furthermore, the list also includes an X disease, referring to an unknown pathogen. The last prioritization exercise was conducted in 2018 according to the WHO and include viruses such as Ebola, SARS, and MERS among others.

The first cases of COVID-19 were diagnosed in the People’s Republic of China in Wuhan the capital province of Hubei on December 2019. On the 7 of January 2020, the full sequence of a virus isolated from a patient was already available on different databases . By, January 30 WHO declares a Public Health Emergency of International Concern due to its rapid spread. And finally, on march 11 WHO declares a pandemic emergency.

In a couple of months, COVID-19 traveled across different countries and populations. Increases in the number of cases also increase the chance for the virus to adapt to the new host. These two features, cases, and adaptation are the two main sources of information during this pandemic. Cases in the form of case counting and adaptation in the form of genetic surveillance. The following describes a series of analyses that can be applied to cases and sequence data sources. It could also be seen as a draft or a blueprint to apply to other outbreaks driven by viruses. Or as a method to analyze endemic diseases through genomic surveillance.

Cases and time scales

Raise in cases in different countries resulted in the development of a series of data dashboards. With the intent to understand the course of the pandemic. One of the main metrics was the pandemic peak. The pandemic peak refers to a particular day when the number of cases reaches the largest value followed by a continuous decline. The rise and fall in the number of cases resulted in a pandemic wave. Several waves have been reported with a space between peaks of around six months.

However, by grouping the data by latitude a series of pandemic peaks appear to be shifted by a couple of days. This specific phenomenon can be seen in the northern hemisphere, while in the southern hemisphere is hard to tell. Nonetheless, a geographical dependence on pandemic peaks has been present in the news and other information sources. News of a new pandemic wave from a country at a different latitude continues to be a common recurrent theme.

Different environmental variables can change latitude-wise. But one that happens around every six months is the shift in sunshine duration. From the end of the year to around half of the year, the sunshine duration increases. While during the second half the sunshine duration decreases. These continuous changes create two stationary points of maximal and minimal sunshine. Using the sunshine duration as a measure of time results in the clustering of COVID-19 cases near the stationary points.

Dependence on solar features can be further investigated by the approximation of the solar flux. The solar flux represents the amount of solar radiation that reaches a particular location throughout the day. Using custom solar features as a measure of time results in a series of COVID-19 waves of less duration. Changes in daily solar flux could reflect temperature changes. Custom solar and temperature features also cluster the cases into periods with a high number of cases.

The number of cases and solar-related features appear to have some relation between them. Changes in solar radiation through time are known as the solar cycle. The solar cycle has an approximate duration of eleven years and is characterized by changes in solar activity. These changes increase or decrease the amount of solar radiation that reaches the planet. Other examples of changes in the amount of solar radiation are the yearly changes in sunshine duration and the daily changes in sunlight. The term solar cycle will be then used to encapsulate all the previous phenomena. If there’s a relationship between the solar cycle and COVID-19. Other aspects of the disease should be correlated to different frequencies of the solar cycle.

Detection efficiency is also affected by the solar cycle. McNaughton reported a peak efficiency in SARS-Cov2 PCR detection around the middle of the day. The authors of the paper raised the importance of such timing as it could impact public health measures. Also, a small report from India by Baruah shows an increase in the mortality of COVID-19 patients around noon. The authors also noted the small sample size and the inability to get an accurate conclusion from the data. Yet, official information obtained from death certificates from Mexico shows some synchronization. Around 20% of official COVID-19 deaths in 2020 happened near noon. Moreover, the time of death appears to synchronize with times near noon.

Yearly and daily components of the solar cycle show some correlation to periods with high susceptibility to COVID-19. On a greater time scale, a rapid increase in cases of COVID-19 happened at a low level of solar activity. The 25th solar cycle started around 2020 with a minimum of solar activity. The omicron variant with the alleged hallmark of generating “mild disease” was detected in South Africa in November 2021. Solar cycle estimation shows that by November 2021 solar activity entered a period of linear increase. A characteristic shared with low COVID-19 cases on the yearly solar scale. Changes in solar activity could be an indirect measure of susceptibility to COVID-19. Changes inside the host that correlate with solar activity could ease the infection process through an unknown mechanism.

Dependence on different frequency components of the solar cycle can lead to practical advice. Avoidance of large crowds or closed spaces at times near stationary points in the yearly solar cycle could lower the odds to get infected. While on the daily solar cycle, nighttime activities might be safer compared to daytime activities. Corroboration of the previous recommendations can be obtained by measuring viral load and viral shedding throughout the day. This could also help to develop more effective treatments. If the virus follows a specific schedule for its synthesis, then the timing of antiviral treatment can be optimized to get the best benefits from it.

Changes in susceptibility through the year can also help to make better long-term assessments of vaccine efficiency or natural immunity. A better comparison will result from comparing data with similar levels of susceptibility.

Sequences and how to represent them

Rapid isolation and sequencing of SARS-Cov2 lead to a particular discovery, the fast adaptation of SARS-Cov2. This started the continuous surveillance of SARS-Cov2 and the detection of different variants. However, working with SARS-Cov2 sequences in large numbers is a challenging task. The average SARS-Cov2 genome size is around 30000 bases. And as the number of isolated sequences grew the number of comparisons grew at a faster rate.

Analysis of biological sequences relies mostly on pairwise sequence alignment and multiple sequence alignments. Multiple sequence alignment is an NP-hard problem. Due to the complexity, many heuristics have been developed to find an approximate solution. Nonetheless, the rising number of sequences restrained the ability to perform large analyses. Particularly due to the need for vast amounts of computational resources.

Computationally sequences can be represented as a continuous string, where each letter represents a single nucleotide. This simple representation is the most widely used to share biological sequences. Another form to represent sequences is by defining an encoding. One hot encoding is a popular technique to encode categorical information in data science. Biological sequences can be encoded into a series of vectors with the same size as the number of nucleotides. A one represents a particular nucleotide, while the remaining elements are zero. This encoding will result in a 2D array of size (4xsequence size).

The use of one hot encoding to analyze SARS-Cov2 sequences has already been used to train a machine-learning model to classify new variants. Classification of SARS Cov2 pangolin lineages uses one hot encoded sequences. Representation of sequences as a continuous string or by one hot encoding preserves the structure and order of the sequence. That characteristic makes them the most used representations of sequences.

A compressed sequence representation can also be obtained by splitting the sequence into a series of fragments. Then the frequency of each of those fragments is calculated and used as a small-size representation of the sequence. To extract as much variability as possible the sequence is fragmented in a sliding manner. Where each k-size fragment overlaps the next fragment by k-1 characters or nucleotides.

Stacking the different fragments from K=1 to k=5 results in a series of high-dimensional datasets. Pairwise comparisons can be done from these datasets. But the number of comparisons increases with the number of samples. To easily look for patterns inside the data a dimensionality reduction technique is applied to the dataset. This results in a low-dimensional dataset that can be easily analyzed to look for patterns. Applying a common dimensionality reduction technique like PCA to the dataset results in a series of clusters.

This simple observation points towards a particular pattern inside the SARS-Cov2 sequences. But, PCA projection results in the loss of almost all the variability inside the original dataset. Variational autoencoders (VAEs) are deep learning models that also perform dimensionality reduction. The additional characteristic is that the low dimensional space contains a learned representation whose axis represents a pattern inside the data. VAEs are also generative models and can reconstruct the original dataset. Thus the model can predict changes in the frequency of fragments.

Trained VAEs resulted in the clustering of SARS-Cov2 sequences. The similarity in color points towards a time-wise pattern inside of the data. And the ability of the VAE to retrieve temporal information is independent of the isolation year.

Traditional time series techniques cal also obtain temporal patterns from fragment frequency data. Seasonal and trend components can be retrieved by calculating the moving average of the frequency data over time. Although the match is not perfect it does point towards a fixed upward direction in the case of some nucleotides or nucleotide combinations. This particular characteristic can aid the design of either seasonal treatments or plan for a long-term treatment strategy. As the amount of a nucleotide increases in the sequence, it will result in a greater susceptibility to specific nucleotide analogs. In the early days of the pandemic the use of remdesivir, an adenine analog, was put into question several times Jiang et al. While current reports show a better picture for remdesivir treatment Gottlieb et al. Seasonal and trend components inside adenine content could explain such discrepancies.

Furthermore, small fragments can also be used to classify SARS-Cov2 lineages. The top 30 most reported variants can be classified with fragment frequency information with around 95% accuracy. This small model can offer an alternative for variant classification. And can be easily scalable to an even larger number of sequences and variants.

Changes in fragments frequency can also point toward mutational hot spots. Periods where SARS-Cov2 genome composition moves from one cluster to another. As the number of new variants isolated near those transition points increases. Determination of those transition points at different geographical regions could help to establish public health measures or for personal risk assessment.

Even with the different patterns that can be found inside the sequence by just fragment frequency, the specific location and impact of the different changes inside the SARS-Cov2 genome are lost. Thus specific changes in amino acids and immune evasion cannot be addressed by fragment frequency at the moment. However, the overlapping fragmentation resembles a De Bruijn graph, a kind of graph used for genome assembly. Extending the graph idea, by sub-fragmentation of the original fragments leads to a collection of graphs that make the sequence. The overlap in the sub-fragmentation yields a series of graphs that encode the sequence by connections rather than nodes.

If the sequence is fragmented into 4-size elements and a sub-fragmentation of 2-size sub-fragments without overlaps, then there are 16 different combinations of nucleotides. This graph can be represented as an adjacency matrix leading to a 16x16 array. Then the sequence is fragmented into 16 non-overlapping fragments leading to a 16x16x16 array. This new encoding captures a part of the location information and allows the information to be closer by expanding the number of dimensions of the encoding.

VAEs trained with the new encoding resulted in the retrieval of a similar temporal pattern. But it also found the predominant location of such changes. Particularly SARS-Cov2 genomic region coding for structural elements contains the most information regarding temporal adaptation.

This alternative encoding technique also shows that expanding the dimensions of the encoding can be a more efficient data structure. As this dimensional expansion brings closer different fragments that are far away. Testing this idea the full SARS-Cov2 sequence is one hot encoded and reshaped into a 32x32x32x4 array.

Ordering the samples by different time measures result in a pattern that appears to be cyclical under the larger time scale. Solar features did not show a meaningful kind of pattern but the number of days since the Wuhan outbreak shows some order. This specific order might hint that the SARS-Cov2 genome is reverting towards an arrangement like the ancestral strain. If this trend is correct might explain the disappearance of pandemic diseases. Reversion to an already known pathogen, either infection or vaccination immunity, will result in a more efficient removal by the host.

Confidence in full-length sequence prediction lowers as the sequence approaches more variable regions. Low probabilities show that the model is not able to confidently predict the most likely nucleotide. This could be due to poor model performance or the need for additional information to increase precision. Expanding the dimensionality of the encoding could help to bring closer together different sequence regions that are far away. Or could also mimic the 3D structure of the genome making it easier for the network to understand the data.

Why does this happen?

The ability to find a series of fixed patterns inside the different sequence encoding shows that adaptation inside the SARS-Cov2 genome follows a deterministic mechanism. And random mutations should be minimal or result in minimal changes in the genome. Moreover, the environmental conditions that follow also show a deterministic pattern.

Proposing a mechanism that could bring together the host-virus interaction into a plausible molecular mechanism will provide information to design better treatments. Yet the virus-host interaction is not the only factor involved.

The host and the virus

One possibility of the fast spread of SARS-Cov2 is due to high susceptibility derived from changes in the solar cycle. Fast spread of SARS-Cov2 aligns with low solar activity in 2019–2020. NASA data also shows minimum radiation at biologically relevant wavelengths. If changes in solar activity are correlated to susceptibility to respiratory diseases, then different outbreaks could also follow solar activity. Historical records show that outbreaks of respiratory diseases occur near stationary points in the solar cycle.

Susceptibility points towards faulty or poor immunological response driven by environmental conditions. How the environment and specifically solar radiation can influence the immune system? Is anyone guess. But, biologically relevant wavelengths are also used to detect and quantify DNA/RNA or proteins, the main components of viruses, in lab applications. Thus if the host has some sort of detection mechanism that relies on specific wavelengths for the detection of pathogenic organisms, it will struggle in conditions with low and high radiation.

Another possibility is a resource-constrained approach. Upon infection one of the first reactions by the host is the degradation of nucleotides. This will lower the number of resources available for the virus and lower the number of viral particles synthesized.

Specific environmental conditions upregulate a series of genes with a similar nucleotide content as SARS-Cov2. This will ensure the nucleotide resources for RNA synthesis, but also the amino acids for protein synthesis. For such a mechanism to be possible SARS-Cov2 will need to sense the available genes or nucleotides inside the cell. One possibility could be to use the RNA structure itself as a logical gate as secondary structures inside DNA or RNA can behave logically . Hybridization between RNA regions and small fragments of RNA will open or linearize the RNA. The combination of the different hybridization fragments can result in an if-and-only-if behavior, where RNA is linearized only when all the fragments are available.

Hybridization between the SARS-Cov2 genome and mRNA or mRNA fragments from the host could work as a gene detection system. It could also be responsible for the development of autoimmunity. The selection of fragments with a better match between viral and host RNAs will yield the selection of highly similar RNA. Viral proteins in turn will have fragments with high similarity to the host.

Comparison between the reference genome and the SARS-Cov2 sequence results in the retrieval of a series of genes with composition similarity to SARS-Cov2. Particularly a subset of genes involved in the Vitamin D pathway. A Vitamin synthesized by solar radiation and low levels is correlated to COVID-19 severe disease. If a long-term starvation strategy is used by the host, then highly similar genes could be downregulated, leading to a wide range of secondary effects.

Each mechanism cannot exclude the other one and a combination of both mechanisms could also be plausible. If SARS-Cov2 can infect at conditions with high susceptibility to the host, then it will adapt to better identify such conditions.

What about the environment?

Another option could be due to an external worldwide factor. Small changes in the temperature can pass relatively unnoticed to us, but as the scale goes down the impact could be catastrophic. For example, the length of a degree in latitude on the surface of the earth is around 100 km, and this distance is even higher for satellites or the international space station.

Changes in the environment of the microscopic world could already bring to extinction the natural predators of viruses. A series of microorganisms with whom we have a symbiotic relationship that we were not even aware. Also, other non-host organisms have been identified to be able to remove viruses, in marine ecosystems sponges have an important role in virus removal. Other organisms could have a similar role in other ecosystems. Urbanization removes most of the natural environment and those organisms in charge of removing viruses are gone. Increasing the overall viral load in urban environments.

Environmental changes could also have a small impact on the way the immune system works. As it relies on well-defined environmental patterns to be fully competent. Molecular cues used to measure time might start to shift between them. Malleability within the immune system can carry on and adapt, but cracks might be showing up. Pathogens passing through those gaps might be a reason why herd immunity is not being achieved even at high rates of infection.

Final remarks

Some background

The previous is an attempt to provide a comprehensive and logical analysis of two of the most common phenomena through the COVID-19 pandemic. The rise and fall of COVID-19 cases and the discovery of the different variants through time. How the different ideas were found or proposed is different from the order in the post. Sequence analysis was the primary driver, and you can see how time-wise the analysis was evolving and other pieces of information were added in the different COVID-19-related posts.

The main findings can be summarized in two, an environmental correlation and a sort of molecular clock inside the viral sequence. Although the accuracy of full sequence prediction remains low current computational resources restrain the kind of search that I’m able to do. Nevertheless, all the analysis can be done with somewhat minimal resources. This will enable the use and customization of such models to local conditions. Likely, variants could also be geographically constrained, meaning that specific SARS-Cov2 genome rearrangements could be specific to certain locations.

Even with a large number of sequences, most of the data is from the USA and could not reflect the full range of possible SARS Cov2 sub-variants. Also, possible data leakages are likely to occur, without prior knowledge of the molecular clock it is unlikely to be able to address probable leakages. If two sequences with equal patterns are isolated at a couple of days of difference and end in the train validation dataset respectively it could lower the validation loss. This could give a false sense of generalization.

The main method used to prevent data leakage was to shuffle the different ids and check if in the different folds the individual batches ended up sorted by the day of the year. No such leakage was found. Also ordering the sequencers by the days from the outbreak results in a similar temporal pattern.

Other applications

Small fragment frequency or k-mers is the analogous model to the n-gram model and currently has been used as a method to encode large sequences. From those encodings, seasonality and variant classification are one of the applications. But those might not be the only ones.

Mean SARS-Cov2 composition, trough sunshine duration as a measure of time, remains fairly constant. Sunshine duration might work as a control variable for SARS-Cov2 genome adaptation. Mechanisms inside the virus could measure the current time and then adapt how the sequence should be constructed. The SARS-Cov2 genome might sense the available fragments at the current time. The other viral components measure the discrepancies between the resources and the genome. This information is then passed to the component that synthesizes the final genome. Such a mechanism will ensure the generation of genomes with a better adaptation to the host. Applying control theory to analyze the different components inside SARS-Cov2 can help us to better understand how all the different parts work together.

Genomic surveillance sequences can also be independent snapshots of different disease states. Then genomic surveillance is a series of snapshots of how the disease progress in the upper respiratory tract. The VAE architecture can be modified to retrieve also a dynamical system that describes the interaction.

The dynamical VAE can retrieve a dynamical system with a transitory phase with a similar duration as the incubation period. This does not mean that the course of the disease could only be the one from the model as only represents the infection in a single tissue/system. But it shows that models can be extended to other tasks. The extension of the different encodings to other tasks could help to improve pandemic preparedness. Also, the application of computational approaches to the study of viral diseases could remove or lower the amount of wet lab experimentation. Adding a new layer of security before performing dangerous experiments.

Unsupervised clustering can also be applied to different viruses. Particularly the use of small fragment frequency appears to be more effective with single-strand viruses. While viruses with fragmented genomes such as influenza viruses result in a single cluster. However, is hard to determine if there’s a temporal pattern inside the sequences as most of the metadata is lost.

The use of small fragment frequencies can also be applied to experimental data. Particularly single-cell expression data. The matrix multiplication between the gene counts and the fragment frequencies of each gene in the experiment will result in a relative quantification of the number of fragments in the cell. Application of this technique to the data from “ Open Problems — Multimodal Single-Cell Integration” shows that cell results are being ordered by the observation day.

Also, a non-perfect match is obtained for immune cell classification from data obtained from the immune cells census dataset. Application to other single-cell data sets results in the generation of several clusters with no clear meaning. Hence this technique might not be appropriate for single-cell data or the number of measured genes is not enough to describe the cell. Also, sequences are obtained from the reference genome and SNPs exclusive to each sample could also change the fragment frequency.

Susceptibility and adaptation

The use of susceptibility for data regarding cases refers to a condition within the host that eases the infection process. Susceptibility can be a baseline for the risk of infection and can increase due to specific host conditions. Host behavior can increase susceptibility by engaging in an unhealthy lifestyle but it cannot lower it below the baseline.

Adaptation in the case of sequence data refers to how the virus adapts its sequence to match the host conditions. While evolution reflects more like a long-term goal and more information is needed to accurately understand the goal. The average course of a COVID-19 infection is around 5 days leading to 12 generations of viruses or 300 human years. That’s more time than the oldest democracy in the world. Thus planning for a more benign habitation of the host seems far-fetched. Nevertheless, I do think that is more plausible that old SARS-Cov2 complains that millennial SARS-Cov2 eats too many avocado toasts.

Perceived viral attenuation could be the result of a resource crisis (Nucleotides, amino acids, energy). Cells and tissues express a subset of genes specific to their function. This leads to a selection of codons by the host a phenomenon known as codon usage bias. The virus mimics the host genomic components by modifying its genome to match the codon bias of the target cells. Those changes increase the infectivity of the virus and are silent to protein-based surveillance. Codon bias optimization aims to generate protein synonym sequences and to find a better match to the available resources. The process will be faster and more efficient and also the resources will be consumed at a faster rate.

Fast resource consumption could deplete cellular resources leaving unfinished the assembly of viral particles or genome replication. The hijacking of molecular machinery by the infection process can also shut down other “garbage collection” processes. Fast resource consumption generates waste at a high rate and eventually shuts down essential processes, triggers apoptosis, or creates other kinds of problems.

What about long covid?

Viral persistence could trigger a long-term starvation strategy and the heterogenicity of the different symptoms might be due to different genes being down regulated. Persistence will also lead to viruses better adapted to the host and copying specific fragments more efficiently leading to autoimmunity. As the host becomes a viral factory high amounts of pyrophosphate are produced due to RNA and protein synthesis. High phosphate could trigger the formation of clots and lower the amount of ATP. Physical activity and exercise will increase pyrophosphate levels returning to a toxic level leading to exercise intolerance. Low levels of ATP could also lower adenine levels altering the circadian signaling of adenine such as blood pressure control and the need for sleep.

Although it’s not exhaustive, the previous hypothesis is a good draft to start to work on different mechanistic explanations of post-infection sequelae.

As always the complete code for the series of analyses can be found on my GitHub by clicking here .

Dimensionality Expansion for Environmental Modeling

Thu, 22 Sep 2022 23:24:21 GMT

Dimensionality expansion

As described before dimensional expansion is a simple technique to create a high dimensional dense data structure for machine learning applications. It is useful for large data samples, like large sequences as presented in the previous example. Large image data could also be transformed to bring close together related attributes.

To test that hypothesis a dataset composed by the AIRS NASA data is used to test its ability to find an accurate representation of the dataset. The data consists of a series of temperature, pressure, ozone, cloudiness, and other variable scans. The data is sampled at 1-degree precision resulting in an array of shapes 180X360. Then this dataset can be reshaped into an array of shape (32,32,64).

Applying a simple convolutional variational autoencoder result in the ability to encode and decode the data. To better understand the nature of the learned representation and how might impact the output data a latent walk is applied to the model. Resulting in a series of images that reconstruct the data.

However, no specific pattern or cluster can be found from the learned dimension, this specific case is true for pressure data.

Autoencoding other data sources resulted in weak clustering the different samples within the learned representation.

Also, the latent walk shows recognizable patterns that change in the same direction as the clustering axis.

The previous two examples show how a single data source can be used to train a simple autoencoder and obtain a small representation of the data. Also, the learned representation shows that learns specific time-related changes that could be used for further applications.

The specific time scale obtained from this analysis could be used as a general time scale to improve weather and environmental modeling. Yet the specific identity of such a scale is not presented or investigated at the moment.

Now you have an example of how to use a simple technique to analyze climate data. And how to extend it with minimal changes in the code. As always the complete code for this post can be found on my github by clicking here .

Creating Synthetic Data with Random Forest Modeling

Sat, 17 Sep 2022 21:43:47 GMT

Using machine learning to extrapolate out missing data in large datasets.

One of the main problems with different datasets is the missing data. Data that only have some annotation that points towards its existence but is missing. For example in the case of time series data, missing data will be missing values in the middle of the series. Values most likely could be inferred by just looking at the graph, yet an approximation of those values will generate a new and more concise data set.

Univariate time series can be divided in a sliding manner creating a series of features that could be later fed to a machine learning model to approximate the missing values. This standard approach can approximate the missing values with great accuracy. However, when the dimensionality of the time series increases the way to process the data might not be as straightforward. Let’s take for example the data obtained from the AIRS/Aqua L3 Daily Standard Physical Retrieval. This dataset consists of a series of features sampled throughout the entire planet. The data consist of a series of individual files with daily information on the different features and can be downloaded from here . If a day is missing from the dataset that file will be missing and due to the sampling process, there are some consistent gaps in some geographical regions.

These gaps slide through time and after some time the entire globe can be sampled. In this case, there are two sources of missing data, missing days where the data does not exist and missing locations. To overcome the missing days first and day index is created and a file name is attached to that index. But if the file is missing the previously sampled file will be attached to that day. This approach will be able to handle small gaps, but if the gaps are of several days then the data will look as if it freezes for a brief period.

This index will facilitate the processing to fill the location gaps. One simple form to approximate the missing data will be by performing a 2D moving average. This operation can be easily performed by loading the data by fragments in the same order as the file index previously created.

However this approach will also smooth the data losing some of the information, yet the window idea of the moving average will help to have enough information to fill the location-wise missing data.

Each dataset consists of a masked array that contains information on each feature. This facilitates the selection of data by just selecting the values within the data with a different fill value. Also, the array locations are retrieved, leading to an array of locations and an array of values. Then a dummy time variable is added to the locations data to complete the first fragment of the training dataset. To complete the training data the same procedure is applied to all the files inside a fixed-size window.

The previous dataset is then trained using a random forest regressor. And the last known time step inside the window is predicted, although is an on-sample prediction time-wise, location-wise will be out of sample. And to reconstruct the complete set of locations a mesh is created to evaluate all the latitudes and longitudes inside the data.

The following approach results in an accurate prediction of the missing data location-wise. While time wise the reconstruction freezes at periods with large fragments of missing data points.

Now you have an example of how to process large 2D time series data and some ideas of how to apply and train a model to predict missing data. As always the complete code can be found on my GitHub by clicking here and see you in the next one.

Impact of Genetic Mimicry on COVID-19 Adaptability

Sat, 20 Aug 2022 20:47:13 GMT

The main idea behind

Cyclical components inside SARS Cov 2 sequence hint at a deterministic pattern inside the sequence. This characteristic must be driven by some deterministic factors. Those factors could be the environment, host, or host adaptations to the environment. By following some of those variables SARS Cov-2 is able to adapt and continuously infect the host. Through this adaptation process, small changes in the sequence will occur leading to new variants. Although this process may appear random, this could be an effect that we are unaware of as the way viruses adapt to the host.

If we look at the SARS Cov2 as a foreign invader trying to colonize a new location. Its capability to colonize the new land will depend on the available natural resources. As the virus is a parasite that relies on the molecular machinery of the host to make copies of itself, then there are two main resources needed. The nucleotides, that are needed to make copies of the genetic material. And the amino acids that are used to synthesize the different proteins needed for its assembly.

This dependence on the resources could explain the cyclical pattern inside the sequence. If we assume that host adaptations to the environment as the main driving force for viral adaptation. Then the virus will try to mimic the gene pool or transcript pool composition of the host at any given time. Or the host will be more susceptible to viral infection at times when the nucleotide pool matches the genetic composition of the virus.

Previous scenarios offer some level of explanation to a previously described phenomenon, endemicity. Tightly following the nucleotide pool can lead to a sustained level of susceptibility. Reaching a plateau after some time, a hypoendemic scenario. While the burst of high susceptibility will fall in line with periods of time with high similarity between the virus and the nucleotide pool, a hyperendemic scenario.

Recapitulating, the environment change, this change generates a response in the host. Leading to the turning on and off of some seasonal genes. Changes in gene expression patterns will in turn change the availability of the resources needed for the synthesis of viral particles. These changes will make a cell susceptible to an infection at particular environmental conditions are met.

A possible mechanism

When the SARS Cov2 virus reaches the cell it internalizes to the cell. Once inside replication of the viral genetic material leads to generating a -RNA sequence. This negative sequence is used as a template to later synthesize +RNA copies. Another process that takes place during the infection process is the shut down of gene expression. This can be achieved by the degradation of cytoplasmic mRNA. In the case of SARS Cov2, the nuclease Nsp15 crops the different host mRNA into pieces. There is a lot more going on during the infection but mi just going to focus on these two processes.

On the side of the host, the first action against the infection is the degradation of nucleotides. This action will lower the free nucleotides, aiming to starve the virus. Without nucleotides, the virus is unable to replicate its genetic material. Unless it uses an alternative source. If free nucleotides are not available or become scarce. Then the only remaining source of nucleotides will be the mRNA fragments.

Capturing the fragments could follow a mechanism like the annealing of primers in a PCR reaction. Local changes in temperature linearize the viral -RNA template, then the different fragments hybridize with the template. Or at the same time as the -RNA template is being synthesized, the available fragments start to hybridize with the -RNA. This will lead to a series of spots that later are filled with the available nucleotides.

This possible replication mechanism would be able to explain the cyclical patterns. And offer an explanation for the seasonality of viral diseases. It could also explain why the use of fragment frequency to analyze the sequences can capture such behavior. Recombination and mutation could be an intermediate step between two environmental conditions. As the available fragment pools changes, then different kinds of annealings and gaps are possible.

As a whole, the sequence contains the same components. But different arrangements lead to new changes in amino acids. Resulting in the new variants, immune response, and evasion, and not because the virus is aiming to do it. But because those will be the only viral constructs that will survive long enough to infect another host.

This mechanism could also explain the development of autoimmunity due to viral infection. When a large enough fragment is added to a structural element of the virus. Then such fragment will be recognized as foreign. Later when the infection is cleared, the immune system will be able to recognize such fragments and react to them.

Some evidence of mimicry

Reconstructing the SARS Cov2 sequence from a series of fragments is a complicated task. On one side is the fact that as the SARS Cov2 evolves, generating changes in its sequence. This in turn also changes the number and kind of similar fragments. Thus a single sequence comparison will lead to a biased approximation. Another problem is the lack of seasonal gene expression databases. (or at least not to my knowledge) This increases the number of comparisons. Thus on the host side of things the best option is to compare the SARS Cov2 sequences with the reference transcripts. This will approximate the available fragment and nucleotide pool inside the cell.

Currently, I have only tried two approaches to compare the reference transcripts and the SARS Cov2 sequences. Both approaches use autoencoders to get a more general comparison.

The first one relies on autoencoders and distance computations to select similar transcripts. Mean SARS Cov2 composition offers a fixed point to compare with the reference transcripts. The first round of selection is done by selecting transcripts with a distance lower than a threshold. Then the selected transcripts are compared with different samples of a latent walk. This ensures selecting transcripts that contain similar fragments as the SARS Cov2.

This approach results in the selection of around 507 transcripts. Frome those around 54 have experimental evidence to have a role with COVID-19. At this point, the screening starts to become more difficult. The scarcity of information about some of the transcripts is the primary driver. Scrapping information from Uniprot results in 135 records out of 507 lowering the odds to cluster the information by similarity.

But a subset of retrieved transcripts showed some agreement with solar radiation. An environmental variable that is tightly correlated with Covid-19 waves. Particularly some selected transcripts had a relation with vitamin D, which is synthesized by solar radiation. Low levels of vitamin D have been correlated with the complication of COVID-19. LCOR, similar to SARS Cov2, associates with HDAC6 which in turn have some regulatory role for the vitamin D receptor or VDR. SLC2536A, similar to SARS Cov2, increases its mitochondrial expression after vitamin treatment. CDS1, similar to SARS Cov2, is regulated by VDR. SLC16A10, similar to SARS Cov2, is being proposed as a response element of VDR.

The second approach tries to automate things a little bit further. The transcripts frequencies are into an autoencoder and selected by reconstruction. Sequence selection is done by low reconstruction error and z-score. This will exclude outliers and select sequences correctly reconstructed. And its reconstruction lies over the domain of the learned representation.

Using this approach with an autoencoder trained with the full SARS Cov2 sequences results in 358 sequences selected from the reference transcripts. With most of the sequences being located at the chromosomes 6, Y, and 4.

Yet, due to its size, some fragments inside the SARS Cov2 sequence might be neglected. To try to get a better segmentation sequences are split into two fragments. One that contains the non-structural genes, and another with the structural genes. This allows finding fragment-specific similarities if there are any.

The non-structural segment showed an almost continuous set of transcripts similar to the SARS Cov2 sequence. The selected transcripts showed similarity to SARS Cov2 sequences isolated at different points in time. Most of the selected transcripts were located at chromosomes 3,5 and 2.

While the structural segment resulted in the selection of transcripts that clustered together at a specific part of the learned representation. And most of the selected transcripts were located at chromosomes 4,2 and 3.

On the second approach, there was not a manual review of the selected transcripts as the election increased around 10 fold. Making it harder to find information on the selected transcripts. Yet the segmentation of the SARS Cov2 sequence appears to catch more information about the similarity between the transcripts and the viral sequences. And as the comparison is made using the autoencoder the result is a global or population comparison.

Some consequences

If susceptibility is driven by genetic mimicry and one of the first actions of the infected cell is to starve the virus by restraining the available nucleotides. Then a long-term solution could be to lower or shut down the expression of a series of genes with similar compositions. This shutdown could lead to the different symptoms experienced during the illness. Or could drive the different symptoms experienced after the acute illness.

If that is the case, then different degrees of similarity to different genes could lead to a series of apparently unrelated symptoms. As the virus adapts to the new cellular resources, then it will change how similar it is to a new subset of genes. This change will result in different symptoms or different post-acute illness sequelae.

This proposed mechanism alone is unable to explain the complete set of changes that happens during the SARS Cov2 infection, both acute and post-acute phase. However, within the small subset of changes that tries to explain it sounds like a possible explanation. Yet continuous research is needed to establish the accuracy of the presented hypothesis.

Remarks at UVA Graduation

Mon, 01 Aug 2022 00:00:00 GMT

Speech given at the University of Virginia July 31st, 2022

"Well, here we are.

Let me start by saying how happy I am for everybody here today. It seems like only yesterday our class was meeting for the first time at the Arlington campus. It was in the throws of the pandemic so we were still wearing masks. Which did a good job hiding our terrified faces when our finance professor kicked off the program by cold calling Amanda on the difference between Net Income and Cash Flow.

5 whole modules we’ve survived since then…

In Mod 1, we learned how to ANALYZE, a word I’ve since committed to having tattooed on my body. Eric learned to hedge investments not himself; and unfortunately we all learned that UVA football, well, next year will be our year.

In Mod 2, we learned linear regression, we learned just how many people aren’t paying their mortgages, and we all learned the hard way that trying to compete with Dan Gogue’s machine learning models is futile. Still not convinced he’s not a tenured professor undercover as the Reese’s Guy.

In Mod 3 we…what did we do in Mod 3, does anyone remember? I was face deep in Christmas cookies at the time. What I do know I learned is that contrary to what one might assume, a SQL query will not run just because you yell obscenities at it. And that knowledge alone is apparently enough to qualify you to consult for Hilton. Michelle Tansey enjoyed it so much she decided to go work there!

In Mod 4 we did some genuinely pretty crazy stuff. I don’t know if you all realize how insane it is that we went straight from writing basic SELECT queries in SQL to neural networks, image recognition, and creating deepfakes of Shelly’s dog. I feel like Mod 4 was the MSBA version of a tequila shot. I mean, we learned that Songyuan and Allen are apparently the missing members of Van Halen, and one of our professors almost punched Ben Fishburn in the face. By the way Aidan, just so you know, Tequila is like water that makes you make bad decisions. That’s what our professor’s call a testable hypothesis and anyone who wants to help test that hypothesis, Wes says the job to be done is at Boylan Heights tonight.

But crazy is how the world of data is, everybody here knows it comes at you fast. It’s as scary as it is exciting. Demanding as it is liberating.

I have loved my brief time here at UVA. And by UVA I mean my zoom room provided by UVA. In large part because of the wonderful people I’ve spent it with, but also because of the fascinating things we’ve learned. Things that would have been unthinkable to me 10 years ago.

It’s odd to remember now, but growing up, I did not like math. Not one bit. Until one day in college, I had to take a complex systems course for my research track. It finally connected math to the world around me. And helped me realize that what I was struggling with wasn’t math’s importance, rather a difficulty with the premise that anything in our real world can be simply explained on a single page.

But Complex systems. Complex Systems don’t have right answers. They are sprawling webs of interconnected variables that don’t always behave the way you expect them to, and most of the time you won’t even know why.

This is the philosophy we’ve found woven through every lesson we have learned in the MSBA program over the last year. In Data Science, we must dismiss the idea that we will ever be right. Because our professors taught us that 100% predictive confidence means something is probably wrong with your data, and you should go back and yell some more obscenities your SQL query. Then buy Dustin a beer and ask him to fix it for you.

Side note: my very favorite memory from this program was our Python professor reassuring Laurel that her faltering model was actually the best model because as long as you did the opposite of what her model said you should do, you’d be right.

We know to some extent we will always be wrong. That’s the humbling nature of the world we live in. But every morning we wake up, take stock of what we’ve learned, and do everything we can before the sun goes down to be less wrong than the day before.

I’ve since adopted this as the best framework I’ve found for approaching data science, and all things that are complex, in truth for approaching life.

The world is changing rapidly every second of every day, and no analytical model, no AI, or supercomputer will ever be able to make full sense of it. But each of us can make decisions in our day, big and small, to do what we can to create less wrong. And as in any complex system, we will see emerge an undeniable trend, a greater direction in what seem like little things.

Nowhere will this be more important than in the careers we have ahead of us. Some of the most important ethical questions of our time relate to how we use data.

Where are the boundaries of public and private data? Is it ethical to use machine learning to replicate an individual’s image? Or predict their actions? Or will prerealizing our expectations of each other diminish our capacity to exceed them?

We didn’t learn the right answers to these questions, because there aren’t any. There is only less wrong. Each of us will be co-authors of our future, weather you’re an AI engineer at Tesla or a crash test dummy at Tesla. And my wish for you, my friends, is that we will all meet every tomorrow less wrong than the day before.

I do not believe there is such thing as a perfect data scientist, nor are there perfect humans. But it has been my great privilege to learn from and alongside amazing people, dedicating what little time they have, to be better at both.

Thank you, and congratulations to you all!"

Methods for Genetic Classification of Covid (Part 3)

Thu, 02 Jun 2022 20:28:23 GMT

In a series of posts, I used some sampling schemes to preprocess large biological sequences. Particularly the SARS Cov2 sequences, mainly due to their availability from the NCBI SARS Cov2 resources site.

I used two representation schemes, a frequency-based and a graph-based scheme. In the frequency-based, the sequences were divided into overlapping fragments. And the frequency of those fragments was used to find a low-dimensional representation of the sequences. As the overlapping fragments resembled the construction of a de Bruijn graph I just extended the idea by using different graph construction schemes.

Both schemes create a small representation of the sequence, but at the current stage is not possible to recreate the original sequence. However, is possible to get a general overview of the sequence with scarce computational resources.

Applying a PCA or a variational autoencoder VAE to those representation schemes results in a series of clusters with a strong temporal component.

(And from this point on and in the following posts I will refer to sequence encodings as either the frequency-based or graph-based sequence representations. The learned representation will refer to the bottleneck in the VAE or other network. And composition will refer to the frequency-based representation of the sequence. This distinction is made as the single element frequency matches the content of the different nucleotides in the sequence. In that case, the value has a well-defined physical meaning. While the remaining values are not clear. )

Thus the SARS Cov2 sequences contain some sort of seasonal clock inside the sequence. Although this seasonal clock can be a side effect of the sampling bias, the number of isolates for sequencing is about 10 to 20 times higher in the second year of the pandemic. Removing such sampling bias by subsampling the sequences showed simar results, representations with a strong temporal component.

A VAE is constructed by an encoder and a decoder network, the encoder yields the learned representation. While the decoder returns an approximation of the original data point. The decoder network also works as a generative model and offers a way to approximate changes inside the input. Thus changes or properties that yield the temporal component can be traced back by analyzing selected points inside the learned representation rather than the whole dataset. Specific patterns can be obtained by analyzing the characteristics of the VAE latent walk.

The clock inside the sequences is encoded by the change in the frequency of different fragments of 4 bases inside the SARS Cov 2 genome. Also, the temporal information is encoded mainly in the structural components of the SARS Cov 2 genome. Yet this does not mean that the other parts of the viral genome cannot change. But rather those “constant” regions might follow another kind of pattern. Or the sequence encoding is unable to provide enough information to characterize such regions.

Plotting the frequency of those 4-bases combinations through time results in a wave-like pattern inside the plots.

However when instead of the isolation date as a measure of time I use the day duration or day length this wave-like behavior disappears.

The use of day duration as a measure of time was the result of several attempts to merge environmental information and the learned representations. Previous attempts showed an agreement between environmental variables with a wave-like pattern.

Using day duration as a temporal scale rather than the Julian day calendar started to show some particular useful characteristics. Most of the cases were confined to the extremes, on the min and max day duration at a particular location.

It also showed that the rate of change in day duration between consecutive days offered a way to approximate the start and the end of a COVID-19 wave at a particular location. This can be used to establish the relative transmission risk of COVID-19. Joining an environmental change to viral transmissibility, similar to abrupt changes in temperature and the flu and some other winter illness.

Why does the SARS Cov2 virus follow such a scale? is a question to which I have no concrete answer. Nevertheless, the SARS Cov2 genome is similar in composition to a series of genes expressed due to the action of VDR or vitamin D receptor. Vitamin D is produced due to exposure to solar radiation. Yet, it’s also similar to a series of other genes with apparently little involvement with solar radiation. Nonetheless, the temperature is correlated to the learned representation and also correlated to solar radiation. Day duration appears to work as a control variable by maintaining sequence composition constant, and day duration is correlated to solar radiation. And some genes similar to SARS Cov2 are regulated by solar radiation. Thus I think is safe to assume that solar radiation has a role in COVID-19 temporal adaptation. It might not be the complete picture, but an important part of it.

Proportional Hazard in Online Gaming

Sun, 06 Mar 2022 16:23:25 GMT

Online gaming communities need to work harder to close the gap for their female users

I'm not a particularly avid online gamer, but I dabble. I've written elsewhere about my love for strategy games, in particular chess . But one thing that's definitely apparent in the online gaming community at large is that there is a gender imbalance, particularly at the higher levels where the game might be monetized. One gaming community had read my previous project on survival analysis, and asked if I could take a look at their user data and assess what might be causing this discrepancy in female gaming professionals (at least on their platform).

I was excited to tackle this problem, but at the same time it was different in some meaningful ways from what I did for the retailer. First the event we're trying to model to isn't a sale, but rather a specific user reaching what we will call "elite status". My first thought was to use a Kaplan-Meier Estimator for the men and for the women separately. I've talked in depth about this estimator before so I won't revisit the math, enough to say it's a cumulative model for the probability an event hasn't occurred after x time intervals. I partition the data into male and female subsets and feed the data into the estimator and get the following Kaplan-Meier Curves:

The graph can be interpreted along the x axis as the number of days that have elapsed, and along the Y axis as the probability they haven't reached elite status. We see that after 1500 days (4 years) of gaming, there is a greater than not chance male users have reached elite status (58%), but the outlook for female users is less optimistic (40%). So there is a noticeable difference, but the weakness of this estimator is that it does nothing for answering if the difference is significant or what factors are driving it. We want to examine the relationship of the event distribution to covariates.

For this we need a regression model. There are a few popular models in survival regression, my favorite to use is Cox's model. The idea behind Cox’s proportional hazard model is that the log-hazard of an individual is a linear function of their covariates and a population-level baseline hazard that changes over time. Mathematically:

The summary output gives information about the formula used to fit this regression model to the data. It also provides information about the sample, has a beta coefficient, standard error, some statistics, and a P-value. I picked out the two variables that had the highest hazard ratio:

The quantities exp(coef) are called hazard ratios (HR). A value of coef greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the covariate increases, the event hazard increases and thus the length of survival decreases. Put another way, a hazard ratio (which is reported as exp(coef)) above 1 indicates that a covariate is positively associated with reaching elite status, and thus negatively associated with how long it takes to reach elite status.

In summary,

HR = 1: No effect
HR < 1: Less likely to reach event status
HR > 1: More likely to reach event status

It seems like a users being a premium member of the gaming community (meaning they pay for additional features) has a high hazard rate but a P value of 0.29 indicates it probable isn't statistically significant. However we see that gender has a statistically significant impact on how long it takes to reach elite status.

Similar to Logistic Regression, the way we interpret exp(coef) depends on the variable type it belongs to. For numerical variables, exp(coef) means that the baseline hazard will increase by a factor of exp(coef) when the variable increases by one unit. For variable gender, exp(coef) is equal to 1.55. This means that the likelihood of reaching elite status for male (gender = 1) users at a given time t is 1.55 times more than the likelihood of reaching elite status for female (gender = 0) users at time t.

That's more than 50% likely to reach elite status, and have access to the kinds of audiences to monetize. That is a huge gap, and my recommendation was the gaming community do everything they can to close this gap. Unfortunately, I didn't have the additional data to be able to recommend what those strategies could effectively be. One thought is that this may be related to the interactions female users have on the gaming platform due to entrenched bias conscious and unconscious within the community. I'm hopeful I will have the opportunity later to revisit their data with a wider lens.

What do retailers have in common with hospitals?

Tue, 15 Feb 2022 23:14:39 GMT

I spent some time working on one of the most interesting problems I've come across in my career.

It started off simple, a retailer is planning to put a new product on their shelves. But as anybody who works in retail will tell you, there is a cost to keeping an item on the shelf. On top of the actual cost of goods sold, there are the costs associated with maintaining the location, as well as the opportunity cost of not having a different item on the shelf. Every day a product stays on your shelf, it eats into your margin. Sadly for me they had somebody else doing that analysis, and their conclusion was that if a product is going to stay on the shelf for longer than 30 days, it should be offered at a discount to get it sold sooner.

So my challenge: how likely is this product to sell within 30 days?

My first thought was to use a logistic regression algorithm like I have for other use cases (see: b uilding a better prediction engine ). But the thing about classification algorithms is they are used to predict how likely an event is to occur at all, so one would be useful in telling me if this product is going to sell eventually . But if I want to know if this product can sell in 30 days, I'll need an algorithm that can recognize what features relate to the amount of time it takes an event to occur.

You know who's really good at stuff like this? Hospitals .

Hospitals need to be able to predict how long it's going to take before a tumor reoccurrence, how long a patient can be in operation, or how long it's going to be before a vital machine fails. They use a technique called survival analysis, which is a means of estimating the probability that something takes longer than x amount of time.

The time on the shelf T may be thought of as a random variable with a probability density function f(30) and cumulative distribution function F(t) = Pr{T =< 30}, giving the probability that the item is still on the shelf after 30 days. It is often more useful to use the complement of F(30). Which will be:

which gives the probability of the item being on the shelf just after 30 days, or more generally, the probability that the item has not been sold in 30 days. There are several ways to represent the distribution of T: The most familiar is likely the probability-density function.

The simplest parametric model for survival data is the exponential distribution, with probability density function and single rate parameter λ in the following form:

I was really excited to try out a methodology I learned in grad school called Kaplan Meier Estimation. It involves computing the probabilities of occurrence of event at certain points of time. We then multiply these successive probabilities by any earlier computed probabilities to get the final estimate.

Total probability of a product still being on the shelf after 30 days is calculated by multiplying all the probabilities of the product still being on the shelf at every time interval before 30 days (by applying law of multiplication of probability to calculate cumulative probability). For example, the probability of a product still being on the shelf after 30 days can be considered to be probability of it still being there after the first day multiplied by the probability of it being there after the second day if it was there after the first day. This second probability is therefore a conditional probability. Although the probability calculated at any given interval is not very accurate because of the small number of events, the overall probability of lasting to each point is more accurate.

As usual, I can count on scikit-learn to have an estimator I can use. I plugged in all the data about how long the product has historically been on the shelf and got the following Kaplan-Meier curve:

This can be interpreted to mean that there is a less than 50% chance a product is still on the shelves after 30 days. or in other words greater than 50% confidence that the product will be sold within 30 days. My recommendation here is for the retailer to define what probability threshold it would like to see before offering a discount. Maybe it's fine with greater than not confidence, or maybe it would like to get to 80% confidence. With the right amount of historical data we could do a comparative analysis at different price points and see how much of a discount should be offered to get to that degree of confidence. Perhaps we can come back and do that later.

This was a particularly fun analysis because it allowed me to experiment with cumulative property in a defined time period, which is a concept I think can be applied to a lot of commercial challenges beyond healthcare.

Methods for Genetic Classification of Covid (Part 2)

Mon, 20 Dec 2021 19:52:32 GMT

Extending previous methods with variational autoencoders

Since the identification and later sequencing, the number of available SARS-Cov-2 sequences continues to grow. This continuous surveillance made possible the fast detection of different variants. Depending on the number and importance of changes on its genetic information such variants are classified into three groups.

A variant under monitoring(VUM), “variant with genetic changes that are suspected to affect virus characteristics with some indication that it may pose a future risk, but evidence of phenotypic or epidemiological impact is currently unclear”.

A variant of interest (VOI) is a “variant with genetic changes that are predicted or known to affect virus characteristics such as transmissibility, disease severity, immune escape, diagnostic or therapeutic escape. And identified to cause significant community transmission or multiple COVID-19 clusters, in multiple countries with increasing relative prevalence alongside an increasing number of cases over time, or other apparent epidemiological impacts to suggest an emerging risk to global public health”.

And a variant of concern (VOC) meets all the criteria to be defined as a VOI but with one or more of the following characteristics. “Increase in transmissibility or detrimental change in COVID-19 epidemiology. Increase in virulence or change in clinical disease presentation. Or decrease in the effectiveness of public health and social measures or available diagnostics, vaccines, therapeutics”

These variants’ definitions were taken from the WHO website and are adjusted periodically. Each variant is assigned to a lineage or established as a new one using different methods. Then an expert panel discusses the available information and presents it to the public.

Computational Methods for Variants Identification and Classification

There are two main tools used to find the particular lineage of a SARS-Cov-2 sequence. The first one is Nextstrain, an open-source project that aims to track pathogenic genome data. On its website, we can find the latest SARS-Cov-2 sequences analysis. The main component of this analysis is the SARS-Cov-2 phylogenetic tree. Making it the central tool used to understand the evolution of the virus. As well as to detect new variants.

To build the phylogenetic tree, Nextstrain uses a tool named TreeTime. It finds an approximate maximum-likelihood configuration of the phylogenetic tree, with large sequence alignments as input data. But, this process is computationally expensive, time-consuming and it will only be helpful if the sequence to analyze is novel.

To bypass the novelty unknown, the Pangolin tool provides a machine learning model to classify unknown SARS-Cov-2 sequences to an already known lineage. Aiming to filter the high volume of sequences due to genetic surveillance. This classifier uses a one-hot encoding of the SARS-Cov-2 sequence and the lineages as labels.

Sequence Representation

Inputs used by the different tools represent some of the many ways a sequence is represented for computational applications. A one-hot encoding of a sequence is perhaps the simplest of the representations. In the case of a biological sequence each of the bases is changed to a vector of size four and the kind of base is encoded by a 1 at their respective position. Then each base in the sequence is encoded with the same procedure, resulting in an array of shape (4, sequence length).

Another common representation scheme is the use of k-mers or k size fragments of the sequence as representation. These k-mers can be used to define a new one-hot encoding. Or the k-mer frequency can be used as another form to represent the sequence.

The frequency of each k-mer will depend on the way the sequence is divided. If the sequence is divided into non-overlapping fragments, there will be (sequence length // k-mer size) fragments. While a sliding window will result on (sequence length — kmer size ) fragments. If the sliding scheme is used an interesting property rises. Each fragment contains k-1 matching bases matching a De Bruijn graph. By using the De Bruijn graph connectivity scheme we can add the connectivity relationship between consecutive k-mers.

Another form to encode the connectivity relationship is by dividing the sequence into 2k-mers. Then splitting the fragments into k-mers and adding a link between both k-mers. Under this scheme, the order of the connections is lost. But the frequency of such connections is added to the representation. As well as being able to encode larger k-mer fragments.

Both connectivity schemes aim to encode the relational information between the different symbols or k-mers that exist on a sequence rather than the symbol or k-mer itself.

By using graphs to numerically define a sequence we can use many of the different matrix representations of graphs. The node degree of the undirected De Bruijn graph will be equal to two times the frequency of each k-mer. While the adjacency matrix will show the frequency of each connection in the sequence.

Low Dimensionality Representation

From the graph representation of a sequence, several datasets can be created. The first and simplest one might be the dataset containing the k-mer frequencies of the sequence. Stacking the different k-mers up to the 4-mer of each sequence results in four visible clusters in the plot.

Using this dimensionality reduction technique results in clear clusters in the data. But a consequence of drastically decreasing the dimensionality of the data is the amount of lost information. Another option to reduce the dimensionality of the data is a variational autoencoder (VAE). In a VAE, the low dimensional representation will be constrained to behave as a normal distribution. Also, the learned representation might contain biologically relevant information.

VAE encoding of the k-mer data results on a series of clusters aligned through the X-axis. Clear separation between clusters appears between a couple of clusters, while others are merged.

To leverage the graph representation of the sequences a new dataset can be constructed by calculating the difference between the adjacency matrices from both connectivity schemes. And rearranging the different matrices into a single one.

Creating a two-dimensional array that can be used as a single-channel image. This allows encoding the connectivity information as well as being able to encode larger k-mers. From this data set, a convolutional variational autoencoder can be used to find a better low-dimensional representation of the data.

The resulting low dimensional space of this new autoencoder results in a similar behavior from the previous one. Except that the clusters are well defined over the x-axis.

Latent Representation

The main advantage of a VAE is the small latent representation that can be used for other tasks. Clusters obtained from the previous autoencoders could represent a single variant or encode another kind of information. Also, the latent space encodes some meaning inherent to the data. For example, in a faces data set the latent space could encode face expression (happy/sad) or face pose (up, down, left, right).

Both k-mer frequency and k-mer connectivity-based representations model a sequence as a whole. Preventing the reconstruction of the sequence. However, biases towards specific codons or some other encoded meaning stll can be found. In a previous post, PCA analysis of k-mer frequency data resulted in clusters that contained sequences from a single geographical origin. Hence it could be possible that the latent dimensions encode for some sort of geographical encoding. While another option could be some sort of time encoding.

Although those are not the only possible options, those are the ones that can be tested with the available sequence metadata. To visually encode the different features I’m presenting a color encoding scheme. In the case of geographical locations, equal colors will represent the same location. However, similar colors will not share any kind of geographical similarity. That will not be the case for time encoding, similar colors will represent closer periods. Time will be encoded as the number of the week, regardless of whether it’s the first or the second year of the pandemic. The time encoding scheme will be based on the isolation date of the sample.

Also, the different models will be renamed, the PCA model will be referred to as Frodo. The simple variational autoencoder will be named Sam. And the convolutional autoencoder Bilbo.

Geographical encoding

In terms of geographical encoding neither Frodo, Sam, nor Bilbo can find any specific pattern. This could be in part due to the heavily biased nature of the dataset. Although it contains sequences from different parts of the world, about 85% of the data is from the USA population.

Time encoding

Time-wise, Frodo can separate the pandemic into two specific periods. Although there is some mixing, this could be due to the cyclical time encoding. As the last part of the year is closer to the initial using this approach.

Meanwhile, Sam can also separate the data into specific periods, it also adds structure and constraints to the data. The x-axis from this latent representation appears to encode some form of time encoding.

Bilbo is also able to encode some kind of temporal or seasonal information into the representation. However, each cluster appears to contain both periods mixed inside the cluster. But the trained network is unable to separate these two time periods into single clusters. Clusters at the extreme of the x-axis appear to start to separate into low time and high time encoding. Further improvement of this model could lead to an alternating pattern of low and high time-encoding.

Cluster Analysis

Performing K-means clustering on Frodo confirms the initial observations of the time encoding. Clusters contain sequences sampled on the first half of the year for one cluster and the remaining second half of the year for the second. Also, both clusters start to merge in the middle part of the year.

While Sam, by adding structure to the data each cluster appears to move through time. Showing two transition steps between the most separated and highly populated clusters.

Although Bilbo is unable to separate into time dependant clusters, the histograms of each cluster confirm the time encoding visualization. Each cluster within Bilbo contains information on both periods.

Variants

Up to this point, the models appear to encode time or seasonal information within the SARS-Cov-2 sequence. However, the remaining dimension could also encode information about the variants. Two approaches are used to test this idea. On the first one, each pangolin lineage represents a single color. While the second consists of a binary encoding where the A and B lineage branches are represented by different colors. As the data set consists mostly of sequences from A or B lineages.

Using that simple color encoding, all three models can encode the variants regardless of the coloring scheme. And this particular encoding is closely similar to the temporal encoding found before.

This would suggest that each branch in the pangolin classification represents a variant with better evolutionary performance to a particular time of the year. If that is the case there should be a difference in the proportion of the usage of specific nucleotides in the sequence. Codon usage bias index or GC content are some examples of features used to test for biological adaptation. But, those features need to be calculated from the open reading frame to have accurate results. But by constraining how the sequence is analyzed perhaps many of the encoded features in the sequence might be lost. Therefore a simple approach will be to check the frequency of each of the different bases within the sequence.

Calculating the histogram of nucleotide usage from one cluster within Sam and comparing it to the remaining ones results in slight shifts in frequency. SARS-Cov-2 sequences collected in the second half of the year contain less Cytosine compared to the remaining period. While the opposite is true for Thymine/Uracil.

Performing a similar analysis on Bilbo results in similar shifts on Cytosine and Thymine/Uracil. However, the shift in the Cytosine or Thymine/Uracil content appears to be greater compared to the shift observed in Sam. But, this leaves again the x-axis without a clear interpretation of what kind of information encodes.

As the data contains a high number of sequences from Covid-19 US cases, perhaps the x-axis is encoding some kind of geographical specific to that location. Plotting the geolocation of each cluster from the first and second half of the year results in an alternating pattern. Figures represent each cluster, colors represent elevation and the x-axis and y-axis represent longitude and latitude.

However, this alternating pattern does not match the clusters within Bilbo. But this simple observation hints that the x-axis within Bilbo encodes for some kind of geographical or environmental variable.

Data biases

Even when the three different models can encode temporal and environmental information. The ability to generalize and extrapolate to general conclusions might be hindered by the biases presented in the data. Data from the first year of the pandemic represents only 10% of the total. Which makes the analysis heavily biased towards the second year of the pandemic.

Geographically, the data can be clustered into heavily populated areas inside the US. Also, I was unable to download the geolocation of some of the cities in the metadata. This lowers the ability to make predictions from Bilbo, the only model to show some resemblance to a kind of environmental encoding.

However, I acknowledge that the different models can help us to have a better understanding of the development of the Covid-19 pandemic. And they describe the pandemic in the US with accuracy.

From the three models, three key insights are found. A seasonal component, perhaps encoded as the shift from the A lineage to the B lineage. A shift in Cytosine use from high content in the first half of the year to low content. Also a shift in Thymine/Uracil in a mirrored way. And a hit to geographical or environmental encoding.

Seasonality and adaptation

Seasonal trends are one hallmark of many infectious diseases. Specific to the COVID-19 outbreak Caetano-Anolles and collaborators showed the existence of a cyclical pattern of mutations in the RBD domain of the spike protein. This cyclical pattern also reflects a cyclical pattern inside the SARS-Cov-2 sequence.

Another cyclical mutation pattern can be inferred by looking at the different genomic surveillance reports in Mexico. For the first half of the year, most of the sampled sequences were from the B main lineage. While for the second half and counting the B lineage started to be displaced by the A lineage.

Reasons behind the shift of one lineage to another and seasonal trends are topics of continuous research. One hypothesis could be the seasonal adaptation to the host by the virus. In lung tissue around 2000 different genes are upregulated by night and 1500 genes upregulated by day. This behavior is also repeated throughout the seasons. Lung tissue differentially expresses around 20% of genes at least in one season. Then the virus by adapting to the available resources to copy its genetic material ends up moving from one lineage to another. This leaves a period to adapt to the newly available resources. Once this adaptation period is completed and the virus can spread more efficiently.

If this adaptation process takes place it could be within the same host or in small increments at different hosts. And could explain the oscillatory behavior of the different pandemic waves. Making it a natural and cyclical process of the virus population cycle. A necessary step for this model to work is the presence of coinfections inside a single host. Cases, where coinfections were detected, remain scarce. However, reports of coinfections in different parts of the world start to become more common.

Reports of single cases of coinfections were reported by Cattoir regarding a Belgian patient. Mohammed Baqur S. Al-Shuhaib reported several cases of coinfections in Babylon Iraq. While Akimkin reported a patient where samples were taken at different times through the infection. Both samples were sequenced and the main lineage was different between samples.

Coinfection with different strains of the virus rises concert among scientists. As recombination events could take place leading to new and different strains of the virus. Wu and collaborators collected sequence reads to analyze for signs of coinfection. This resulted in samples with signs of coinfections with two or three different lineages. Yet the ratio of coinfection events drastically drops in the middle of the year.

Even when that finding could contradict some of the previous ideas. Reanalyzing sequence read for signs of coinfection could be helpful to understand if a yearly dynamic adaptation process takes place.

Environmental and seasonal treatments

Nucleoside analogs are one of the most common pharmacological treatments for viral diseases. In the case of Covid-19 Remdesivir is perhaps the most widely known antiviral treatment. In a meta-analysis, Hsueh showed that Remdesivir is associated with a better clinical outcome. Yet, even with a numerical reduction in mortality, it was not statistically significant. If we analyze the overall content of adenine, the molecular analog of Remdesivir. We can see that there is a slight deviation in the histogram, pointing to a bimodal distribution.

But, if we look at the adenine distribution within the different clusters in Masha only one cluster contains sequences with high and low adenine content. This could indicate that there might be a subset of Covid-19 more susceptible to Remdesivir. And this susceptibility is specific to a particular environment.

A clear definition of nucleotide usage between the different clusters is perhaps the most useful piece of information. This shift can be exploited for treatment as described before. Another option for treatment can be found in the shift of Cytosine or Thymine/Uracil. Literature about the use of Cytosine analogs for covid treatment is scarce. However, a reduction in hospitalization risk among HIV patients might hint at the possibility to use such antivirals. Studies in Spain and Germany found a reduction in risk of hospitalization, as well as most of the cases, were classified as mild. In both studies, the participants were on antiviral treatment containing at least one Cytosine analog. While in the case of Thymine/Uracil, Sofosbuvir is the main antiviral being studied. A meta-analysis by Chih-Cheng Lai showed that Sofosbuvir increased the recovery rate and decreased mortality.

Seasonal and environmental patterns inside the SARS-Cov-2 sequence can be taken into account to further test the efficacy of different treatments. Clinical trials and continuous research on the topic are the only methods to fine-tune possible treatments.

Models reproducibility

Over this post, a series of models were analyzed to understand how the low dimensionality representation of the sequence data encodes biologically relevant information. Each model can be trained without the use of highly specialized hardware. And they have already shown their capability to understand and provide some insights from the sequences. Also, for the two variational models, only one latent dimension was more or less characterized. This leaves enough room to add more metadata to newly sequenced SARS-Cov-2 strains or to try to add more to the already existing ones. Improvement or development of new generative models could lead to patterns that lead to applicable actions. From predictions of the possible outcome to resource management.

Scripts used to filter the sequences, create the datasets, and train the neural networks can be found on my GitHub by clicking here .

The Mind-Blowing Math Behind Chess

183:920405337 (William McNamara) — Sat, 05 Jun 2021 04:07:53 GMT

The game you're playing has probably NEVER been played before.

When I was little me and my cousins would bond with our grandfather through chess. He was very good, and challenged us to think not just about the next move, but the one that came after. It's a lesson that's very valuable in the life of anyone fortunate to play this ages-old game. But as I got older, I thought about how formulaic it is. I know games like Tic-Tac-Toe and Connect Four have been solved so that if you go first you can win every time, surely somebody could do that with chess if they haven't already, right?

Well let's break it down.

In order to "solve" chess, you'd need to figure out a combination of moves that wins in every game. How many possible games of chess are there? In 1950 an American mathematician named Claude Shannon came up with a way to answer this question. Shannon estimated that there are on average about 30 legal moves from any position in the game of chess, then your opponent goes and has about 30 more. So combined that creates about 900 combinations for a given "ply" (chess terminology for a move and its response).

Shannon then estimated that a typical game consists of 40 plys, which means about 80 moves! So the equation is kind of simple then, 30 potential moves raised to the 80th power, which is a 10 with 120 zeroes after it possible games in chess , known respectfully as Shannon's Number.

This number is massive, billions of times greater than the number of atoms in the observable universe (10^80). Let's say one were to build a chess computer and asked it to consider all outcomes and tell us the best first move in chess. It would still take that computer over 10^100 years just to figure out the first move, by which time most physicists agree the universe will have long ended.

This opened up another question. How many moves into a game can I reasonably assume that this game has never been played before? To answer this question, we can part's of Professor Shannon's methodology. Let's build around a potential model where Pr(u) is the probability that you're playing a unique game for m number of moves that have been played such that:

In this case the function P(m) will represent the number of games played and T(m) will represent the potential number of games. This ration will be expressed as X and for all value of X less than threshold z we will consider the game unique.

Chess is about 500 years old. Let's assume that 1% of all the people who have lived since then have played a game of chess every other day for 50 years. That means around 10 trillion games have been played (10^12). For starters that means an unimaginably small percentage of the total playable games of chess have been played (over 100 zeroes after the decimal point). But that's when m gets up to 80. how about for the first few moves.

If we model out how many potential combinations we get of the first, let's say 10 moves, we get the following table.

Next we have to model how many games have been played for each value of m. Lets use a standard normal distribution for this function such that:

For N we can use our previously calculated estimate of played games at 10^12. We can use Professor Shannon's mean number of moves at 80, and we can use a standard deviation of 20 moves. From this we can calculate the probability that your game is unique:

As you can see nearly all of of the 10 trillion games have been played have made it to move 7 so the likelihood that your game has been played is high. But what's interesting is the ballooning denominator. It'll hit that 10 trillion number soon. So what if we look at the next 7 moves?

By the time you get to move 10 there are 590 trillion possible games to have played, against 10 trillion played all time, so there's now a high likelihood that your game is unique. Once you take the 11th and 12th move of the game it is a virtual certainty that your game of chess has never been played before !

Now obviously this isn't entirely true, if you're playing at high levels of competitive chess you may know theoretical openings and defenses that extend well into the 11th and 12th moves. But probabilistically, games that reach that point have an extremely high likelihood of being unique.

I decided to test my statistics by using chess.com 's analysis engine . The engine has over a billion stored games from its players and you can see how many times your specific game has been played, and what the most common next move is. I decided to play the most theoretically common sequence of moves, the Sicilian Defense, and see how far I got before it was something new. I got 36 moves in before the most common line of moves became something the engine had never seen before, so this was an extreme upper bound. When I went back and decided to go with even the second most common next move, it was only 12 moves before it was a unique game. Confirming that even slightly varying from the common theoretical line almost always puts a chess player in the realm of unique games by move 12.

So for all you non chess grandmasters out there - next time you're playing chess, and you're 15 moves in, stop to think for a moment that you're probably facing a specific problem that has never been seen before, and may well never be seen again! It kind of makes the moment more interesting.

Methods for the Genetic Classification of Covid

183:920405337 (William McNamara) — Tue, 23 Mar 2021 23:49:46 GMT

An exploratory analysis

About a year has passed since the first SARS Cov-2 sequence was made publicly available, and new sequences from other outbreaks around the globe have been made public. The GISAID and the NCBI are two of the most widely used services to upload and access the SARS Cov-2 sequences for scientific research.

Basic Characteristics

The sequences were downloaded from the NCBI COVID-19 resources on February 19 in a FASTA format then sequences were loaded using biopython and only complete genomes were selected using regular expressions. The description over the FASTA sequence was used to scan for complete genomes and ‘(.*)’ to ignore other characters.

From that selection, the majority of the sequences have a length of around 30 000 bp (base pairs). Removing the outliers defined as those sequences with less than 10 000 bp shows that the majority of the sequences have a length in the range between 29 500 and 30 000 bp.

Those selected genomes contain the DNA sequence of the SARS Cov-2, and that sequence must contain only four different letters A for adenine, C for Cytosine, G for Guanine, and T for Thymine. Nevertheless, sometimes DNA FASTA sequences contain more characters and those characters mean the possibility of a subset of the DNA available characters. For example, the letter N on a FASTA sequence means any of the four characters can be found in that position. To check for sequences that contain the standard set of characters first the sequence is fragmented into a one-element list and the unique elements of that list are obtained. If the length of the unique elements set is equal to four, that sequence is said that uses a canonical alphabet.

To accelerate the process, the iteration was parallelized with the multiprocessing package. An index is returned that contains the location of the canonical and non-canonical sequences. That analysis results in 30211 canonical sequences and 13460 non-canonical sequences. As the non-canonical sequences with only two different characters can generate a set of sequences much greater than the canonical sequences the non-canonical set of sequences will not be analyzed. Thus the canonical set of sequences will be taken as the ground truth.

From those canonical sequences, the majority of them come from the USA or Australia, with about 80% of the total sequences. From the remaining locations, Netherlands and France contribute the most with new SARS Cov-2 genomes. It’s important to point out that many sequences in the location found in the description is the one of the outbreak rather than the country, or a different abbreviation for the country. All those sequences were classified as Non-Standard.

While the complete genome size reported from each country has a mean of around 29800 bp, being the USA, Australia, and the Non-Standard locations the ones with the higher variability in size, skewed towards a smaller complete genome.

K-mer

Like the n-gram model of early computer linguistics, the k-mer refers to a sub-sequence of k elements within a biological sequence. The division scheme used to obtain the k-mers can affect the number and kind of k-mer obtained in a sequence. The non-overlapping division scheme divides the sequence into k-size fragments and the total number of k-mers obtained is equal to the length of the sequence divided by the length of the k-mer. While the sliding scheme takes a k-size fragment from the sequence and slides through the sequence one character at each step, the total number of k-mers obtained is equal to the length of the sequence minus the k-mer length. To obtain the sequence k-mers the sliding scheme will be applied with the following.

Know that the k-mers can be obtained for each sequence the next step is to determine the appropriate length of the k-mer to be used. By plotting the number of unique k-mers of a given size that can be generated by the data set it can be observed that the 6-mer is around 80% away from the theoretical maximum number of k-mer. And from that point forward the number of uniquely generated k-mer increases rapidly reaching the theoretical maximum. To prevent the generation of a data set with a greater number of k-mers than samples the max k-mer to be analyzed will be up to the 5-mer.

To analyze the unique k-mers, the frequency of each of the k-mers is measured from the data set. First, a function is defined to count the frequency of each k-mer in the sequence. Then that function is paralleled for a fast k-mer frequency calculation.

Know with the k-mer dataset created is time to look for patterns in the data, a quick way to look for patterns is just to use a simple scatter plot. However, as the resulting data set is multidimensional thus the PCA projection can be used to look for patterns. The following shows the PCA projection of the 1-mer then the 1-mer plus the 2-mer and so on for the available data. It can be noticed that as k increases several data points start to form different clusters.

To analyze the clusters, the labels of each cluster are obtained using the DBSCAN algorithm and the elbow method to determine the min distance between neighbors.

Each different color represents well-defined clusters, while black represents outlier samples. From those clusters, two are selected to be analyzed for differences between them. The particular reason why these two clusters will be explained later.

The clusters

The first cluster or cluster one contains 69 sequences while cluster two contains 7978 sequences. The mean sequence size of cluster 1 is around 29520 with few outliers. While the sequence size distribution of cluster two appears to be bi-modal, with two peaks, around 29810 and 29840 bp.

To make a pairwise comparison of the similarity between the samples of each cluster the euclidean distance, the correlation, the cosine similarity, and the city block distance are used as similarity measures. From those four measurements, only the correlation shows a distribution with only one peak, while the three remaining measurements show a bi-modal distribution, however, the second peak is several orders of magnitude smaller than the main one.

With the same approach, both clusters are compared with the Wuhan outbreak reference genome to assess the dissimilarity between the clusters and the reference genome. Cluster one is as dissimilar to the reference genome as is dissimilar to cluster two, as most of the similarity measures lay on a similar range as the ones from the previous analysis.

While cluster two is more similar to the reference genome, mean euclidean distance, city block distance, and cosine similarity are around half the values found when compared to cluster one. However, correlation values drop to a mean around 0.4, which could mean that although the samples are close, there is some variation between the different k-mers.

K-mer and reading frames

The k-mer model can extract information from sequences with biological and functional meaning. To understand better the intersection between the k-mer model and biological sequences let’s explain the basics of DNA and gene or protein expression.

A DNA sequence contains the information to make the different proteins that make our cells, tissues, and body. DNA is transcribed to messenger RNA or mRNA, then is read in the ribosome by fragments of three mRNA bases or triplets. Particularly the non-overlapping 3-mer data have a well-defined biological meaning, however, if the 4-mer is taken two overlapping sets of triplets can be obtained. Following that idea, the 5-mer contains three sets of overlapping triplets. Each triplet from the 4-mer or the 5-mer contains a range of possibilities on how a sequence can be read, those possibilities are also known as reading frames. Thus, when analyzing up to the 5-mer or the sliding 3-mer, essentially it has encoded information for up to three different reading frames.

Aside from methionine and tryptophan that are encoded by one triplet, the remaining amino acids are encoded by more than one triplet or codon. Different codons are used in different frequencies, that phenomenon is called codon usage bias or CUB. CUB is a natural phenomenon that has been studied by several scientists in different model organisms. Changes in CUB from a pathogen such as a virus will reflect an adaptation of that pathogen in some form.

To analyze the CUB it’s important to have the open reading frames of the SARS Cov-2 virus, an open reading frame (ORF) is defined as a part of a sequence bounded by a start codon and a stop codon. As simple as it may sound, finding ORFs is a difficult task especially on a virus where space limitations on its genome make the virus use overlapping reading frames. SARS Cov-2 contains several overlapping ORFs. It can be expected that with more time more overlapping ORFs can be found. Hence to alleviate the existence of overlapping ORFS the sliding 3-mer will be used to evaluate in an exploratory manner for a possible change in CUB. As that analysis can’t be said to be equivalent to an analysis of CUB it will be defined as pseudo-CUB.

The distribution of the different triplets found on cluster one and cluster two show no extreme deviations from the mean between the different triplets, with only around a 10% deviation from the mean.

Comparing cluster one with cluster two for the pseudo-CUB results in a greater usage of ‘T’ containing codons for cluster one, while cluster two appears to favor ‘C’ or ‘G containing codons. A ‘GC’ CUB is often associated with greater stability on the resulting DNA strand due to the ability to form three hydrogen bonds, however, the ‘GC’ content is not the only factor that contributes to DNA stability.

Comparing cluster one and cluster two with the reference sequence shows that on cluster two few bars cross over 0.2 while the comparison with cluster one shows a greater difference. Cluster one appears to favor codons with ‘T’ compared to the reference sequence. While Cluster two appears to have a mild increase in most of the codons, about one-third of the available codons show a small increase in frequency.

Regardless of its origin CUB regulates translational efficiency (going from mRNA to protein), as many of the resources for translation are shared inside the cell. Thus the sequence of highly expressed genes of a cell will determine that cell CUB. In the case of a viral infection two phenomena are in place, the CUB from the host organism and its target tissue, in the case of SARS Cov-2 the respiratory system, and the CUB of the virus.

Changes in the SARS Cov-2 genome that affect CUB are not necessarily mutations that create a new functional viral protein. Mutations that affect the CUB theoretically will enhance translation efficiency, which means that fewer virions (viral particles) are needed to infect and cause disease. Does that mean that eventually, SARS Cov-2 will have a CUB equal to the one found in the respiratory system? Perhaps, but that will be detrimental to the virus. Most viruses already have a strong promoter. a piece of sequence that promotes the translation of the virus, thus a strong activation by the promoter and high efficiency will exhaust the translational machinery of the cell. From the point of view of the virus, a non-optimal CUB is needed.

Clusters Origin

Know with better knowledge about the SARS Cov-2 genome and how some changes on the viral genome can affect the spread and adaptation of the virus let’s find out the origin of the clusters. Four different clusters and a set of outliers were found using the DBSCAN algorithm, those will be named A, B, C, and D.

Cluster A contains sequences mainly from the USA followed by non-standard locations, the Netherlands, and other countries.

Cluster B contains sequences from a non-standard location, a quick examination of the description attribute from the FASTA sequence shows that all the sequences are from PER or Peru.

Cluster C contains a mixture of non-standard locations nonetheless the majority of the sequences are from the USA.

Cluster D contains only sequences from Australia, also most of the Australian sequences can be found in this cluster.

And around 20 different sequences did not meet any criteria to be classified as one of the many clusters. And most of those sequences can be found close to a particular cluster.

From those four clusters, cluster one is the same as cluster B and contains only sequences obtained from Peru, and cluster two is the same as cluster D and contains only sequences from Australian isolates. Both clusters were selected for the difference in how the pandemic has affected both nations. Peru faced a health care system saturation on the first COVID-19 wave and the total number of cases is around 1.3 million cases. While Australia opted for a strict lock-down and closed its borders to foreigners on the 20th of march. The total number of cases is around 29000, which means that one of every four COVID-19 cases was sequenced for further analysis.

Although it can be noticed a socioeconomic difference between nations, both applied similar measures to alleviate economical and healthcare pressure. Both release emergency funds for businesses as well as workers. Both enforced social distancing limited the number of participants in social gatherings and used in different extents curfews. Perhaps due to some cracks in the system, miss calculations or other sociological problems both nations ended on extreme sides of the COVID-19 pandemic.

Leaving out how the pandemic was managed by both nations and just looking at the total number of cases the difference is several orders of magnitude. With a high number of infected individuals, the probability of SARS Cov-2 mutates increases. Hence mutations that enhance host adaptation such as a change in CUB will be easier to appear in a population with higher levels of exposure to the pathogen.

Classificant and K-mers

To have a model that can to some extent classify from sequence to geographical origin can be a helpful model for many applications. Nonetheless, the k-mer model is not an exact model, from the four clusters two had a common origin while two were a mixture of origins. Hence, some scenarios will be discussed.

Geographical origin.

It could be said that Cluster C perhaps contains sequences from Mexico, a neighbor of the USA, and cluster A is a set of miss classified sequences. However, cluster C indeed contains sequences from Mexico but also from Chile and other locations. Thus the common geographical origin classification might not be true.

Common outbreak.

Cluster A contains the majority of the sequences also contains the reference sequence. A possibility might be that the different clusters that surround cluster A are just outliers from cluster A that propagated through different countries.

Common Adaptation.

One of the many proposed explanations for CUB is the selection theory, meaning that codon bias contributes to the efficiency of protein expression, that contribution leads to a better performance of the individual and positive selection. In the case of SARS Cov-2, small changes in the virus sequence might be driven towards replicate the host CUB and being able to replicate more efficiently in the host without any modification at the protein level.

Mix of everything.

With the high rate of mutation on RNA viruses, it could be possible that someone infected with SARS Cov-2 travel to another part of the world and generates new SARS Cov-2 mutants with a better adaptation to that particular part of the world. Thus that variant becomes dominant in that geographical location.

Surveillance, Outbreak characterization, and K-mers

Even though the exact interpretation of the resulting clusters is a matter of discussion, one key aspect from the pseudo-CUB analysis is that a set of triplets can be found in a greater frequency in one cluster compared to the other one. An extension of that is that a set of k-mers appear on a different frequency from one cluster compared to the other one. Thus when the k-mer data is projected by PCA the projected space represents the data as separated as possible. With that characteristic, the PCA projection of k-mer data can be used for epidemiological surveillance.

If the k-mers of the SARS Cov-2 virus that affects a specific population are already measured and characterized, then it could be possible to detect new variants from local outbreaks, or the introduction of new variants from other parts of the globe.

For example, if only the samples from cluster two are projected by PCA two clusters are obtained. From the 1-mer up to the 4-mer two well-defined clusters can be observed, while the 5–0mer exhibit around six defined clusters. It will be interesting to know if both clusters represent both SARS Cov-2 waves that Australia has faced off. Know if some samples from random locations are added to the cluster two data from two to twenty new different samples, it’s impossible to differentiate from cluster two data.

But, from 200 to 2k samples well-defined structures appear, with 200 new samples the 4-mer and the 5-mer can differentiate the original samples from the outliers. And with 2000 samples the 3-mer, 4-mer and 5-mer can differentiate the original samples from the outliers

With a couple of hundred new sequences, the direction of the pandemic on a given location can be tracked down. Both the emergence of new strains or the consolidation of one can be examined with the k-mer data.

Pros and Cons

A neglected fact of this data set is how unbalanced it is regarding the nations that report complete SARS Cov-2 genomes (at least for this database). As most of the sequences are from the USA and Australia the bias on the data could make it easier to differentiate Australia and the USA. However, two well-differentiated USA clusters were found and the distance between clusters is large enough to make them recognizable. Also, a biological meaning can be found, as the k-mer data was able to differentiate between two very different pandemic scenarios and one with limited samples available.

Clusters were obtained from the PCA projection of the k-mer data leading to a loss of information from the k-mer data. From around 80% of the explained variance with the 1-mer to around 27% with the 5-mer data. Even though an insight can be obtained from the clustering of that data, most of the information is lost due to the transformation. A better sequence representation needs to be put in place to obtain a better and general conclusion.

Historically correspondence analysis (CA) has been used to analyze CUB, even that here a pseudo CUB is analyzed, CA was not used. However, PCA and CA are dimensionality reduction techniques that return the projection of the data by an axis. That axis is given by the eigenvalues of a dispersion matrix. PCA uses the covariation matrix while CA uses the weighted variance or inertia. As PCA is one of the most widely used algorithms in data science, PCA was used for its popularity and simplicity.

Most of the context surrounding each k-mer is lost when the frequency of each k-mer is measured. That context could be part of a regulatory region in the sequence or an overlapping ORF just to mention a few. Nonetheless, the k-mer analysis can be useful to evaluate multiple reading frames and is easier to solve and scale compared with multiple sequence alignment.

Know you have an example of how to use some basic tools from data science to visualize and analyze SARS Cov-2 sequences and obtain some useful insights. Also how to use those tools to propose a new data science application, The complete code for this post can be found on my GitHub by clicking here , and the data set can be obtained from the NCBI COVID-19 by clicking here . Just remember that the number of SARS Cov-2 sequences grows every day and perhaps some of the obtained results may vary. Please stay safe and talk to you soon.

Ordinary Differential Equations in Python

Wed, 03 Feb 2021 18:53:41 GMT

A simple python script to solve, fit and analyze ordinary differential equation (ODE) models

Used to explain planetary movement, disease spreading, reaction kinetics just to name some examples. Differential equations are one of the most useful mathematical technique to model natural phenomena. A key part shared among differential equation models are the parameters that manage to control the behavior of the model. Those parameters can be approximated by a regression technique, using measurements of the appropriated natural phenomena. If the structure of the model can be known (the differential equation has an analytic solution) the regression problem can be easier to solve. However, some differential equation models don’t have an analytic solution, which results in the use of numerical methods to obtain an approximated solution. The following describes a python script to solve, fit and analyze a simple ordinary differential equation (ODE) model.

Defining and solving the model

As radioactive decay was my introduction to ODE modeling, I’m going to use it as an example (as my first post, I think kind of fit the theme). Radioactive decay is given by the ODE:

Where C is the amount of radioactive material and k is a positive material dependent constant. To solve that ODE, we need: create a function that calculates the right side of the equation, an array that contains the integration times, and the odeint function from scipy. Using the initial condition C0=0, and k=2, and solve the equation we get a curve like this.

Parameter estimation

Once we know how to numerically solve an ODE, we use the curve_fit function (also from scipy) to estimate. First, to generate some data, we add a uniform random number to the obtained solution. That will give us a data set to work with. Then just call the curve_fit function to estimate k, plotting the results we get a curve as follows.

Ideally, we’ll be working with experimental data, however as we are working with simulated data, we can explore some other aspects of ODE modeling.

Data size and parameter estimation error

Data acquisition is usually constrained by several factors that can diminish the number of data points that we are able to obtain. We can evaluate the impact of data acquisition by changing the number of data points that we generate. Using a similar strategy as described above we can observe that the absolute error of the estimation is around the same between 1000, 500,100, and 50 data points.

Residuals analysis

We already observe that with around 50 data points the estimation error converges to a minimum value. To determinate if the model correctly describes the data, we can perform a residual analysis. A residual is defined as the difference between the data and the regression model, and as we are working with simulated data, we should be able to get the same random noise that we add.

By its shape residuals can also tell us if the model lacks some terms. Adding a linear or a periodic term to the simulated data, residuals correctly reconstruct the provided term. Allowing us to make changes to the model based on the residuals, or to adjust the data acquisition

Concluding remarks

As described above, we already know how to solve and fit a simple ODE model, and some analysis techniques that can be applied to analyze any regression model. The complete code of this tutorial can be found at my GitHub by clicking here . And in the next installment of this series, I will show you how to solve and fit an ODE system using python, and some other analysis tools.

Building Better Prediction Engines

Mon, 21 Dec 2020 04:16:28 GMT

Sometimes, you just gotta build it yourself.

I was working for a B2B growth sales organization with the familiar problem of having too many leads to work and not enough time. The challenge was to think of ways that the sales organization might better target their limited time toward those prospects most likely to convert into business.

The conceptual model is simple, and has been used to predict discrete events for centuries. A subject (in this case the prospect) has a given set of characteristics X that yields a Y probability of the event occurring (in this case a sale). This probability can be expressed most easily as a binary ("I think this event will happen"), categorically ("I am very confident this event will happen"), or as a probability itself ("I am 95% confident this event will happen"). Even children will intuitively use this model in their head when they're assessing weather or not to ask their parents for a cookie before dinner.

The difficulty is eventually you go from your parent's kitchen to a multi-million dollar company with operationally complex partnerships. What may previously have been a handful of considerations is now hundreds if not thousands of variables in X. Thank god for statistics!

Iteration #1: Third-party software

Like any good developer given a new task, I first look to see if anyone else has done this in a way I can copy meaningfully use. Fortunately for me, the company was already using an enterprise platform with predictive functionality. All I had to do was tell it what column to predict and presto it would use all the other data in that table to make the predictions!

Okay. Is that...too easy?

It's not uncommon for applications such as these to hide the details of their methodology int he interest of preventing people from copying their IP. But the thing is if you're going to have a secret methodology...it has to work. in this case the predictions output by the standard model were accurate only 45% of the time. This means the sales team would have better luck randomly guessing what will convert than if they use this model. This performance can likely be attributed to the fact that models need to be trained to the specific data model, otherwise it will make assumptions about what is or is not important that may not be correct. So I decided to build a custom model, given my closer understanding of the data, to see if custom could outperform standard.

Iteration #2: Machine Learning

I decided to go with a Python model because all we're really doing is pulling, analyzing, and pushing data with an API. Python is a great language for data analysis, and especially great if you're just getting into programming. My plan was to build a classifier, which is a type of algorithm for grouping a population into sub-populations (in this case will/will not convert), and applying a continuous probability to its prediction using the following form.

After loading and cleaning the data, which any data engineer knows is going to be 75% of the job, I divided the dataset into 80% of historical records I will use to train my model, and 20% of historical records I will use to validate the model's success. The reason data shouldn't be used for both is it can lead to overfitting, meaning it would be very good at predicting against the sample but not data generally. For training my model I used a Gradient Boosting Decision Tree from the scikit-learn ensemble module for ML in Python. Gradient boosting is a method that combines weaker models together, and improves itself over stages.

The number of weak models used is controlled as a hyper-parameter, as are the size and depth of each model used. After fitting the model to our training data we're then able to apply that model to the validation data. I started with the model's defaults and started making minor adjustments to the hyper-parameters to optimize the model's performance. It became apparent to me that higher learning rates for each weak model had the biggest impact on performance. Through manual adjustment I was able to increase the model's performance from 80% to 88% accuracy. But this method was slow, and I knew a faster way.

Iteration #3: Machine Learning w/ Grid Search Optimization

Scikit-learn has another great l ibrary for model selection that compares different configurations across the hyper-parameter space and picks the one with the best performance. They automated tuning, what a concept! They tradeoff is you don't want to give it too broad of a hyper-parameter space to look through or it can take a LONG time to run. In the interest of saving time I had it search in iterations of 10% up to +/- 30% in either direction from the default hyper-parameters.

When I applied the grid search optimized hyper-parameter to the model, I got a staggering result of 92% accuracy! Again revisiting the original premise, this model is able to correctly predict if a prospect will convert into business 92% of the time, as compared to the 45% we started with using the third-party model. Still I wanted to see where the 8% was coming from so I could have an idea of how the model might be improved in the future. I saw that the 8% were mostly false negatives, which means the model is slightly pessimistic, and underestimating the conversion potential of some prospects. This can be visualized with a Receiver Operating Characteristic (ROC) Curve . ROC curves typically feature true positive rate (TPR) on the Y axis, and false positive rate (FPR) on the X axis. This means that the top left corner of the plot is the “ideal” point - a FPR of zero, and a TPR of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. The “steepness” of ROC curves is also important, since it is ideal to maximize the TPR while minimizing the FPR.

Conclusion

In conclusion, I was happy to be able to provide a model to help target the sales organization's limited time, and proud that my custom model was able to perform 2x as well as a standard third-party model. In the three months following the implementation of my classifier, there was a 15% increase in the organizations lead conversion metric, representing millions of dollars in potential new business. Still there are some areas for improvement. False Negatives potentially represent missed business if sales reps don't follow through, and so I fully intend to revisit this at a later date and see what we can do about that last 8%.

3 Methods for Modeling Protein Aggregation

Fri, 07 Aug 2020 02:34:05 GMT

How do clumps of bad proteins form? A few different ways...

Continuing on my bioinformatics kick, I recently read a fascinating NIH paper about about polymerization, which we can think of as cellular construction. Protein chains, which I have written about previously, serve two vital purposes in our cells. First, they create a cellular skeleton that gives the cell its shape. Second, they form networks of roads and highways that allow various cellular components to move around.

The cell carefully controls how these protein highways and supports are built using special helper molecules. Some of these helpers create branches in the protein chains. Others act like anchors, connecting these protein networks to the cell's outer boundary. But sometimes things can go wrong, proteins can start clumping together inappropriately. One of the most well-known examples of this happens in Alzheimer's disease, where proteins called beta-amyloid create harmful clumps in the brain that block normal cellular function.

Scientists have developed different ways to understand how these protein clumps form. Which I'll investigate here:

In the first and most basic mode, individual proteins join an existing clump, eventually reaching a physical limit.

Under this model, we can see that the aggregate reaches equilibrium before the proteins are exhausted. This points to the slow rate at which the aggregates grow and that they cannot grow infinitely through time even at large concentrations of monomers. But it doesn't explain the creation of small aggregates.

To that end, the second scenario, or the autocatalytic model was developed. In this model the aggregation happens in a chain reaction in which changed proteins can only join with other changed proteins to form clumps.

The Kinetics of the model is similar to the previous one, exhaustion of monomers happens at a fast rate but the amount of proteins capable to create aggregates also rises to the same level.

So here we have a third scenario, which involves a middle step. Before proteins can join the clump, they need to go through a change, this actually slows down the clumping process because there's a limit to how quickly proteins can go through this transformation.

Understanding these different ways that proteins can assemble or clump together helps scientists develop treatments for diseases caused by protein clumping. It also helps us appreciate how cells normally keep these processes under control to build useful structures rather than harmful clumps.

The key difference between normal protein assembly (polymerization) and harmful clumping (aggregation) is organization. When proteins assemble normally, everything has its proper place. When proteins clump abnormally, there is chaos, though scientists are often discovering hidden patterns in the chaos.

Visual Analytics to Understand COVID-19

Mon, 01 Jun 2020 01:47:35 GMT

How researchers are going to be looking to cure coronavirus

A few months ago, COVID-19 changed all of our lives by forcing much of the world into government-mandated lockdowns. Personally, the virus has impacted me and my family significantly, and I have a lot fo time since I can't leave my house, so I wanted to do what I could to try to understand the virus in my way. Hopefully you find this interesting and and if you're here that probably means you learn through visual analysis like me. So here we go!

When a virus like COVID-19 infects our cells, it's like a tiny hijacker taking over a factory. The virus sneaks its genetic instructions into our cells and forces them to make new virus parts; like the factory getting a new set of blueprints to make its products. These virus parts are initially made as long chains called polyproteins, which need to be cut into smaller, working proteins. COVID-19 has two special molecular scissors (called proteases) that do this cutting job. Scientists have captured detailed images of the main protease, giving it an identification number of 6LU7 , you can read more about it here . In this post, I'll show you how to create different ways to visualize these proteases using a 3D graphics program called Blender.

First, we can write out the protease as a string of letters, where each letter represents a building block called an amino acid. While this might look like a jumbled alphabet soup, it's actually the complete instruction manual for building the proteases (the scissors).

Next, we can turn this string of letters into a graph, where each data point represents an amino acid, and lines show how they're connected. This helps us see patterns that weren't visible in the string of letters – some dots have many connections, suggesting they might be particularly important parts of the protease.

We can also create a bar graph showing how often each type of amino acid appears in the protease. In doing so we're taking inventory of all the parts needed to build it. Some amino acids might be used frequently, while others appear rarely.

Perhaps most excitingly, we can show the actual 3D shape of the protease. This 3D model helps researchers design drugs that might fit into specific pockets of the molecule, finding the right path to lock up the protein and stop the virus.

Each of these visualizations tells us something different about the same protease:

The letter sequence makes something invisible become visible
The graph reveals which parts might be especially important
The bar graph shows us which building blocks are used most often
The 3D model helps us understand where potential drugs might attach

By looking at the same object in different ways, we can discover new patterns and better understand how these proteases work. This understanding is crucial for developing treatments for COVID-19 and similar diseases.

12 Steps to Start a Career in Salesforce

Mon, 06 Apr 2020 00:00:00 GMT

If you work for a company that sells something (or even one that doesn’t), you’ve probably heard of Salesforce. You may have also seen someone in your network posting about certifications and Salesforce-specific jobs, and you’ve decided you want to learn what that’s all about.

Salesforce is a customer relationship management (CRM) platform, and companies of all types and sizes use it to transform their business. Market research firms expect the Salesforce economy to create 3.3 million new jobs and close to a trillion in new business revenue by 2022. Because of this demand, recent salary surveys for 2020 put the earning potential for Salesforce Admins between $100–125k (with developers earning as much as $170k)! Generally, skills in these income tiers can be difficult or expensive to learn, but not so here! Salesforce is an easy, fun, and free skill to learn if you have the right path.

So, if you’ve decided you’d like to transform your career and start setting yourself up for a career in Salesforce, you’re making a great choice! But there are so many resources out there and so many job titles to navigate it can be hard to get an initial read on what the heck is going on in the Salesforce economy. So here are 12 steps that can help get anyone to a full-time career in Salesforce.

#1: Consider which track is right for you

Salesforce is a giant ecosystem with a lot of career tracks to meet you where you’re at today. The most basic and popular of these career tracks is the Administrator track; there is even a saying amongst practitioners that the “Admin is king”. You can think of admins as generalists; they have wide enough subject knowledge to cover most Salesforce use cases (depending on the product). For this reason, nearly every Salesforce organization will have admins running the show.

Then there are more specialized tracks, notable of which are the Developer Track and the Consultant Track. These specialized tracks will typically overlap quite significantly with the admin track, but go deeper on selected subject areas like industry products or APEX programming. Typically, these tracks have their own certifications and are tailored for individuals with experience on the platform in an admin role. If you’re interested in learning more, you can look at all the tracks here .

#2: Listen to your end user

Unless you’re looking to hop to a Salesforce role with another company, you’re probably going to have to make a case how this could be an impactful change for the business. It’s a great thing if you have a boss that is open to new ideas and invested in your personal development; but if you really want to make the case, you need to come to the table with ideas. And the best people to get ideas from are all around you: the users.

If you’re in a role that uses Salesforce, you probably know the ups and downs of how your team uses it. You probably know every annoying thing, who experiences it, and why it should be fixed. Now picture every team that uses Salesforce at your company, each with their own problems and ideas. This knowledge is gold to a Salesforce owner; so talk to the users. Get to know their pain. If you come to your manager with problems identified and a plan to deliver business impact by fixing them…it’s hard to say no.

#3: Talk to a practitioner

If your company already has full-time Salesforce people, talk to them. Chances are they’ll be glad you did. There is so much to do in Salesforce, and so much impact to be delivered, teams are typically under-resourced and more than happy to get reinforcements. If they don’t outright offer to teach you themselves, they’ll probably give you guidance on how to train yourself. Keep them updated on your progress, ask them what else you can be doing and for honest feedback on your development.

When you’re ready (not for another couple steps!), you’ll be able to come back to them and show them what you’ve accomplished. Ask them if they can give you some easy and low-risk tasks to do, something to help them ease their own plate. They’ll probably be able to give you something real you can sink your teeth into. Repeat this with more and more advanced tasks and you’ll be getting real experience in a real organization that will be invaluable to your future.

#4: Probably start with one product

It’s easy to think of Salesforce as a single platform, and that’s because it is. But behind the curtain, Salesforce is a suite of products with very different functions and use cases. Your company may use one or several of these products in its instance, and before you start any course of study you should figure out what those products are. Because wouldn’t it be awkward if you learn everything about a Salesforce product your company doesn’t even use?

If your company doesn’t currently use Salesforce, you’re in luck because you have your choice of the lot! You’ll probably want to pick a product that could be useful to your company or a company you want to work for. The big ones are Sales Cloud, Service Cloud, and Marketing Cloud. Most Salesforce organizations license one or multiple of these three products. Sales Cloud is what put Salesforce on the map, with 83% of Salesforce practitioners reporting proficiency in the technology. This is probably the best place to start, particularly if you’ve selected an admin track.

#5: Get a free Developer Edition account

You do not need access to a production Salesforce organization to learn Salesforce, you can deploy your own personalized instance for free! Salesforce offers free developer edition accounts that can be deployed in minutes that operate as kind of sandbox to give you risk-free creative freedom. No matter how badly you screw up, nobody will ever know! I highly recommend the first thing you do is go register for one of these sandboxes so you can test your learnings in a real environment.

#6: Go do your Trailheads!

The Salesforce ecosystem is massive. Even the most experienced Salesforce professional can learn something new and apply it to a problem their business might be facing. This means there are endless opportunities for growth and professional development on the Salesforce platform.

Salesforce has invested millions in resources for training and certifying learners at all levels. The most popular platform for learning Salesforce skills is Trailhead , where learners (or “Trailblazers” as they are referred to in the ecosystem) can browse a massive library of easily explained walkthroughs on just about any Salesforce use case imaginable. Trailhead has also gamified professional development, offering badges and credentials for concepts mastered that you can add to your resume or Trailhead profile. Other e-learning platforms like Udemy and Focus on Force have courses you can take that are more specifically tailored to achieving a specific certification or Salesforce career.

I really cannot emphasize this enough: do your Trailheads! If you follow none of these other steps, follow this one! Bookmark the website, set reminders on your phone, do whatever you gotta do; but for the sake of all that is useful, do your Trailheads!

I’m going to leave it there for now, but rest assured I’ll come back to stress this again later.

#7: Don’t worry about programming

You may have heard a popular saying among Salesforce practitioners, “Clicks not code”. This is a mantra that simplifies the mission of Salesforce’s powerful and extensive suite of low-code tools to enable organizations to deliver digital transformation without having to hire programmers. Salesforce’s updated user experience, Salesforce Lightning, even extends their drag-and-drop functionality to deploying and scaling custom applications, which was something previously reserved to coders.

If you’re looking to start a career in Salesforce but don’t know how to code, have no fear because you do not need to have one iota of coding ability. Part of what makes the Salesforce ecosystem so empowering is anyone can learn it, regardless of skill set. In fact, most sophisticated Salesforce organizations don’t need to write code at all.

All that said, developers aren’t going anywhere; and learning how to code can significantly extend your capabilities in Salesforce to hyper-specific enterprise problems your business may be facing. If you’re interested in learning SOQL or APEX for Salesforce, you can find plenty of lessons on Trailhead.

#8: Learn all things Lightning

A few years back, Salesforce released a new user experience that extends the capabilities of their drag-and-drop features to application development and component-based page layouts. This makes learning how to code in Apex even less urgent now than it was previously. But the fact is many organizations still haven’t made their transition over from Salesforce Classic to Salesforce Lightning, but most of those same organizations are looking for people to help them do it. Learning Lightning will not only help grow your capabilities within the platform, but also improve your value to the Salesforce job market. I recommend the Admin Essentials in Lightning Trail to get started with your Lightning education!

#9: Join a user group

Okay, this one is by no means essential, but joining a user group can be a great opportunity for professional networking (and fun too!). Cities across the world have user groups where Salesforce practitioners of specific or any skillsets can gather to learn from each other, network, or just enjoy each other’s company! There are user groups for specific products, roles, or even concepts. These user groups are oftentimes financed by Salesforce, and sometimes the issues and solutions identified in their sessions even make their way onto Salesforce’s product roadmap. You can see what’s in your city here !

#10: Get Certified

Once you’ve mastered what you wanted to learn in Trailhead, it may be time to start considering getting certified! Certifications are proctored exams taken to verify to the outside world that you know what you’re talking about. Some Salesforce jobs require certifications, some don’t; but in any case, it gives you an edge up on the competition when the hiring manager can be assured you’ve studied what they need.

You can register for certification exams on Trailhead. There is a cost to register so you’ll want to select which one is right for you and study hard. Some of the exams have brutal pass rates, but if you happen to not pass your first time around, you can retake for a discount within a few weeks. If you’d like extra exam preparation akin to your SAT prep courses back in high school, there are in-person or online courses you can take that will train you for a specific certification exam.

#11: Be prepared to wear multiple hats

Cool, you’ve got your certification and you’re ready to drop it on LinkedIn! Now what? Well, oftentimes the next step will be up to you, but it can’t hurt to develop some soft skills. Salesforce is a relatively new career category, and so most organizations don’t have traditional project manager/technical resource structures built out. Oftentimes you’ll need to be both the technical resource and your own project manager.

This can be jarring for someone coming from a big company or out of school who may be used to having defined and achievable tasks laid out for them by somebody else, or having another check their work to see if it’s correct. For those thinking of a Salesforce role, a critical skill to learn will be setting yourself up for success, and building close relationships with coworkers to sanity check ideas. But once you’ve mastered this, you’ll come to enjoy the agility and creative freedom unburdened by extensive feedback apparatuses.

#12: Raise your hand

Alright, you did it! You’re ready for a career in Salesforce. You have your certifications, you have the right attitude, and you have a plan to deliver impact for your organization. The next step can be the most difficult for some, but the next time somebody needs something done in Salesforce, raise your hand. Let others know this is something you have the desire and know-how to do. Nobody is going to intuitively know what you want from your career, so you need to self-advocate. Maybe your company has been looking for a person to own Salesforce but haven’t gotten around to hiring someone, or maybe there’s a position for you at another company in your network. There are hundreds of thousands of Salesforce positions out there, go shoot your shot!

8 Tips to Avoid Reckless Technical Debt

Tue, 08 Aug 2017 20:48:32 GMT

Originally Published on govloop.com: Source

In the digital age, IT experts have evolved from computer fixers to institutional power-players, oftentimes driving projects for critical changes in infrastructure. But in both the private and the public sectors, a lot of factors should go into building a new digital system. An important consideration is to avoid decisions that will incur technical debt.

Technical debt is a concept that reflects the hidden cost of implementing a sub-optimal IT infrastructure. Basically, the shortcuts you take to cut time or cost today will create headaches for you tomorrow. Any time you make a sub-optimal technology decision, you’re incurring debt for the future projects you’ll need to fix it.

Now before I get to how to avoid it I want to point something important out. Debt in any form is not always a bad thing. Smart operators can make strategic decisions to incur a little technical debt to create a certain value now, be that a faster deployed project or saving money for another project.

As anybody who has ever had a credit card knows, debt only becomes dangerous when you get reckless. When the interest that you’re paying on your debt cripples your ability to make the necessary change. Similarly, technical debt can be prudent, or it can be reckless.

So how can you avoid the reckless accumulation of technical debt? At then end of the day a competent manager should know what to do, but for those that are interested here are a couple of tips to keep your project team on the right track.

1. Define your project goals before you do anything else

This one should be a no brainer. If your development starts before any design is in place, it’s like driving in a foreign country without a map. And the chances are you’ll need to come back and rework things that weren’t designed correctly. If this happens, and you identify refactoring that will be necessary, don’t delay it, because your coders will be building something that they’ll need to fix later. Or, you can change the design and build it correctly the first time around.

Now obviously it’s impossible to create a perfect design up front. Business requirements can and usually will change up until the day the project is finished, and reworking will be necessary to go in and make the changes to comply with them. But knowing that everything is not always in your control, try and be smart about planning the things that are in your control. That way you’re not accumulating technical debt at twice the speed.

2. Anticipate necessary integrations

While you’re in your planning stage, ask yourself what other areas of your infrastructure this project will affect. If there will be necessary integrations (which in all likelihood there will be) then figure out if your design can work in a plan for that. By doing this early on you save yourself the time on refactoring and reconfiguring if two systems won’t play nice. Be prepared.

3. Take your time to do it right the first time

Of course you are going to get pressure from your higher ups to deliver on a project as soon as you can. After all, time is money. But decisions you make to cut corners will cost you down the line.

If you just do the minimum amount to get a project finished, it will be released with a substantial amount of technical debt. And the risks of any anticipated refactoring will increase dramatically, especially if it becomes integrated with the rest of your IT infrastructure. Save the nightmare later, do it now.

4. Document everything you do

Your 3rd grade math teacher taught you a valuable lesson, “show your work.” Make sure that your developers and project managers document their work so that if/when any change needs to occur, there is a clear roadmap to do it.

5. Flexible is better

Your IT needs are going to change, think about how much the landscape has changed in the last decade alone. Unless you want to keep scrapping and rebuilding a new system every couple of years, it’s in your interests to crate a flexible modular software design.

Modular software allows you to make changes to one component or functionality without having to change everything. Tightly-coupled components create a web of technical debt that makes even the smallest changes massive and expensive. Don’t do that to yourself.

6. Try to avoid parallel development

If you have a large team it can be tempting to segment them and have them develop parallel branches that you’ll merge later. This is all fine and good but you will spend time and money later on merging them onto a single source base. Change developed in isolation accrues technical debt. It’s usually better in the long run to have your whole team on a single road map.

7. Test suites are absolutely necessary

A test suite is a collection of test cases to make sure your software does everything you need it to do, it is one of the cornerstones of Quality Assurance. To ensure compliance with business requirements, test cases will create the necessary conditions to prompt behaviors in your software that are considered desired or optimal. It’s a great way to search for screw ups.

The real world is NOT a test suite, and when you just throw your system out of the nest before it’s tested you risk a world of embarrassing malfunctions, system failures, and costly repairs.

8. Be watchful of IT contracting companies

Now, I don’t want to knock IT contractors. Most of them will base their business off of providing you the best possible solution so that will continue to contract with them or give them good recommendations.

But also be aware that sometimes contractors could skip steps intentionally or mistakenly that will incur technical debt, creating more problems for you and more billable hours for them. The best way to avoid this is to do your research and get to know the clients your potential contractors have worked with. It’s a simple phone call that could save you millions of dollars.

All of these steps are straightforward approaches that have helped project managers to deliver long term solutions that don’t break the bank. As I mentioned before, a little bit of technical debt can sometimes be strategic, but keep an eye on it to make sure it doesn’t get out of hand. And as always, plan ahead.

Don’t Fear Artificial Intelligence, Plan For It

Thu, 03 Aug 2017 20:44:22 GMT

Originally published on govloop.com: Source

Earlier this week, Facebook once again splashed across news headlines with reports that they had shut down one of their artificial intelligence (AI) programs after two computers created a linguistic shorthand to communicate with each other.

The AIs, named Bob and Alice, were attempting to imitate human speech, but found it easier to just create a machine language of their own. The developers working on the project could not understand what the machines were saying, and consequently cancelled it. That’s it. There was no fear they had created a super-intelligent humanity-destroying intelligence, they just couldn’t understand what it was saying and that’s not useful.

What I find interesting though is how quickly bloggers and internet commenters wanted to jump on the doomsday bandwagon. The truth is no superhuman AI is currently in existence, but that’s not to say it won’t come about this century. The key is not to think of it as an existential threat, but rather an event we can plan for and build around.

Why do we need it?

For most of our history we’ve invented technologies to replace our muscles: moving heavy objects, ease of transportations, those kinds of things. It’s only really in the last century we’ve started inventing things to replace our brains: calculators, record keeping software, even predictive models for things like trading and population growth. But things change, and sometimes even the technology we invent can’t keep up.

The purpose of AI will be machines that can learn and adapt to changing circumstances. They’ll be flexible, self-learning, and intuitive. And yeah, they’re going to be better than us at some of these things, probably a lot better. But when you think about it, that’s literally always been the case.

Phonebooks used to be essential, now we have Google. When the refrigerator was invented, it really did a number on the market for ice houses. And I’m sure there were some pretty angry town criers when the printing press started working. But we changed, new technology brought new jobs, lifted us out of the dark ages, and made us more connected than ever.

What we can learn from history though is the importance of doing things safely. Because we do live in a very connected world, and in some ways that makes our information and privacy more secure, but also more vulnerable. As progress in AI continues to make headlines, we should place focus and attention on developing safe, purpose-driven, and controllable technology.

What does “safe” AI look like?

Computers have always been better than us at holding and analyzing vast amounts of information, but what they’ve never been better than us at is knowing what to do with it. As computers reach the point where they start to understand the value and usefulness of a piece of information, it’s important we program into them what values should be applied to what processes.

Computers can watch us, that’s what they’ve been doing is gathering information from our actions, but the key for the future will be to have them watch our behaviors and how we make decisions, because that will teach what we value and how we will expect them to make decisions. Like teaching a child with an infinite amount of information they can access.

“AI services” is a phrase I’ve never heard before but that I am certain will become a household phrase in my lifetime because these machines will need people to direct them, to point to problems and say “solve that”, or to teach what it means to do good so the machines can point themselves. Plenty of jobs will be created by AI, we really don’t need to be afraid of that.

So instead of being afraid of computers, let’s be excited for what they’ll bring and embrace the disciplines that brighten our lives. The arts, design, writing, teaching, social work, sports; I don’t see those jobs going anywhere anytime soon.

Don’t be afraid of AI, embrace it. Because weather you fear it or not chances are we will see early intelligent machines in our lifetimes, and we can make that a doomsday scenario, or we can prepare to meet the future the way we always have.

The Country Taking CSI to The Next Level

Tue, 25 Jul 2017 20:41:39 GMT

Originally Published on govloop.com: Source

Imagine you’re a police detective arriving on the scene of a suspected meth lab. There’s equipment everywhere and you have limited time to decide what’s most important. All of the sudden, an arrow appears over a bottle on a table that says “Bag this please.” These kinds of prompts, previously restricted to video games and futuristic movies, may be making their way to reality; that is, if one country can prove that they work.

A pilot program recently implemented in the Netherlands leverages augmented reality (AR) to send visual guidance from remote experts to crime scene investigators in the field, the prompts just pop up in their goggles. Think about it like Pokemon Go, except instead of catching creatures, they’re identifying and collecting evidence to catch a criminal.

The system is one of several technologies created by the AR development company TWNKLS, in collaboration with the Dutch Forensic Institute and the Delft University of Technology. The technology offers the following advantages:

More data can be gathered and applied to investigative decisions
The unified framework allows for quicker well-focused actions
Step-by-step tracking allows for accurate documentation and decreased margin of error
Easy communication allows for higher quality collaboration between distributed individuals and teams
Applying diverse backgrounds and expertise to investigation fosters greater common understanding ultimately serving the quality of response.

Principal Researcher, Dragoş Datcu says the full version could be available as early as later this summer. And although there’s no evidence of any American law enforcement agencies planning to follow suit, AR technology is arriving at a critical time in public sector growth for the Netherlands.

The new system comes in response to diminishing resources to address increasingly complex crimes. AR enables subject matter experts to feed critical guidance to detectives across a large area in a short amount of time, allowing them to be actively involved in more investigations than ever before.

“We’ve tried the system and it really adds a lot of value to many different areas of policing,” said innovation adviser Nick Koeman from the National Police of the Netherlands. He went on to mention however that the technology is not in use for making arrests because it could be too distracting to operators.

Cool, but what does this mean for the United States?

The key question is could such a technology be implemented in the United States? Historically, law enforcement agencies have in fact been on the front lines of funding and testing new technology; think radio transceivers, biometrics, even body cameras, all made their way around the police circuit before becoming commonplace technologies.

And while we shouldn’t expect to see FBI agents running around in AR goggles anytime soon, it’s likely that similar pilot programs could begin with well funded local police forces like the NYPD or the San Francisco Police Department, both of which have already received several offers to test similar virtual systems.

A benefit to consider when we talk about visual technology in our police force is the opportunity to create greater accountability in one of the public sector’s most important functions. Liberal democracies long have struggled to minimize or eliminate elements of bias in their systems of criminal justice, and AR could allow for the complete reconstruction and visualization of a crime scene for investigators and juries, a feat until this point only attempted with a lot of paperwork and physical reenactment.

Finding the right balance of AR and human investigation in the criminal justice system will be a difficult task. Law enforcement can be innovative, but it can also be change-averse and prefer to instead trust procedures that have worked for decades. But we can’t just abandon innovation because it sounds strange. If you ask me we should follow the Dutch example, who knows, it might even (and probably will) become the new normal.

William's Blog

Building an Artemis II Tracker

How I tracked humanity's return to the Moon with Python, Streamlit, and a handful of free APIs

The Idea: What Would Mission Control Actually Show?

Setting Up the Project

The Heart of It: Querying JPL Horizons

Derived Metrics: Making the Numbers Meaningful

Space Weather: Protecting the Crew

Live Imagery

The 3D Scene: WebGL Inside Streamlit

﻿

Caching: Keeping It Fast and Polite

Deploying It

What's Not There (And Why)

Closing Thoughts

Hours to Minutes: Scaling User Clustering with Topological Manifold Learning

How I made a critical segmentation algorithm 9x faster with a new approach to clustering

The Problem Statement

UMAP Approach

Measuring Performance

Assessing Processing Time

Assessing Clustering Quality

Business Impact

Conclusion

Joe Biden's AI Legacy

Not much to undo anyway

Executive Order

National Security Memorandum

What comes next?

Exploring Alphafold Model (Part 2)

ColabFold

Preparing My Experiment

Generating Multiple Sequence Alignments

Running the Prediction

Final Thoughts

Exploring Alphafold Model (Part 1)

The Protein Folding Challenge: Why It Matters

Alphafold

Predicting My First Structure

Understanding the Predicted Structure

Closing Thoughts

Extracting Patterns from Genomic Data

Introduction to Time-Dependent Pattern Recognition

Multi-Scale Time Series Analysis

Implementation Details

Dimensionality Reduction for Pattern Discovery

K-mer Representation for Sequence Analysis

VAE Architecture and Implementation

Domain-Specific Feature Extraction

Data Visualization for Pattern Discovery

Multi-Output Model Architecture

Interpretability Through Feature Correlation

Differential Response to Solar vs. Atmospheric Variables

Actionable Insights from Atmospheric Analysis

Biden's Executive Order on AI: It's a start

A good step but not nearly enough

The Civil Rights Division of the Department of Justice is directed to address algorithmic discrimination in federal technology systems

Private companies will be compelled to share the results of safety tests on AI systems that are tied to national security or critical infrastructure

The Department of Commerce is is directed to develop standards for watermarking AI generated content

Dimensionality Expansion for Genome Sequencing

Joining my dimensionality expansion methods to my bioinformatics research

Breaking Free from One-Dimensional Thinking

From Lines to Planes to Volumes

The Perfect Testing Ground: Biological Sequences

Putting Theory Into Practice: The HIV Genome

Expanding to Dengue and SARS-CoV-2

Why This Matters for Data Scientists

The World of Multimodal Foundation Models

A primer on foundation models: what they are, how they've evolved, and where they're going.

Intro to Foundation Models‍

Transfer Learning

Transformers: The Underlying Architecture For Foundation Models

Vision Transformers

Transformer Variants

Large Language Models

‍The Rise of Large Vision-Language Models

Conclusion

Fun with Spotify's API

Solving the tragic expiration of good recommendations with a little Python.

A New Scaling Paradigm from Google