Hours to Minutes: Scaling User Clustering with Topological Manifold Learning

William McNamara • September 19, 2025

How I made a critical segmentation algorithm 9x faster with a new approach to clustering

If there is one topic I enjoy more than any other, it’s dimensionality reduction algorithms (which I have written about extensively). One of the most useful applications of dimensionality reduction is user segmentation, where machine learning can take a multitude of inputs and create better clusters of similarities than single variables that seem important but get stripped of their deeper context.

A couple of years ago, I built a user clustering algorithm that leverages t-distributed stochastic neighbor embedding (t-SNE) to predict how likely customers are to churn or expand their business. What I didn’t expect though was to run into scaling issues so quickly.

The Problem Statement

t-SNE has to consider every possible relationship between customers, but this thoroughness turns into a weakness. Imagine you’re looking at a city, and you ask every person in the city how similar they feel to another person in the city, and repeat this millions of times for every person in the city. That’s t-SNE, every time you add a new customer, the volume of calculations scales exponentially.

When I launched the program, it took a couple hours to run through our customer base, which meant running it once a day, a consolation I had come to live with. But every time we added more customers to our analysis, t-SNE's processing time didn't just increase - it exploded.

I needed to figure out how to achieve higher-performance clustering, ideally without sacrificing accuracy or run frequency.

UMAP Approach

With all of the buzz around the training of foundational models at companies like OpenAI and Anthropic, I’ve been familiarizing myself with Topological Manifold Learning which I’ve found to be a more intuitive way to think of optimizing algorithms than the traditional ways I was taught in school. In this context, topoliogical manifolds simply give greater consideration to the natural structures emerging in data. Returning to the city analogy, if instead of comparing every resident of the city to every other resident of the city, you asked how similar each person feels to their neighbor, their neighborhood in general, and the community of the broader city. It saves you and your algorithm a lot of time by preserving global and local structures that already exist.

Enter UMAP (Uniform Manifold Approximation and Projection) modeling, which represents a fundamental shift in approaching dimensionality reduction for user clustering. Instead of comparing every user to every other user like t-SNE does, UMAP builds a network graph where each user connects primarily to their nearest neighbors, then uses cross-entropy optimization to find a low-dimensional representation that preserves both these local neighborhoods and the global structure of how different user segments relate to each other.

This approach delivers dramatic computational advantages. Theoretically it reduces processing time while producing more stable and consistent results for business impact. So I decided to test this out!

Measuring Performance

Evaluating the comparative performance of UMAP will require a multi-dimensional approach that goes beyond simple speed comparisons to ensure the migration delivers genuine business value. Processing time should definitely improve, but I will also need to validate that faster clustering doesn't come at the cost of insight quality or stability; which even if faster could harm business decision making.

I’ll measure four critical dimensions of computational efficiency:

Processing Time
Clustering Quality
Business Impact

This comprehensive approach involves running both algorithms on identical datasets with optimized parameters, conducting statistical significance testing to ensure observed differences aren't due to randomness, and validating that technical improvements translate to measurable business outcomes like better personalization performance or more actionable user segments.

Assessing Processing Time

I’ve built an evaluation framework that tests both algorithms across progressively larger datasets from 1,000 to 100,000 users using identical hardware configurations and multiple runs to account for statistical variation. I chose to measure not just the core clustering time, but the complete end-to-end pipeline including data loading, preprocessing, and post-processing phases, since the clustering algorithm represents only one component of the total user segmentation workflow.

The testing protocol includes statistical significance analysis using paired t-tests and confidence intervals to ensure observed speedups are genuine rather than measurement noise, while also documenting memory usage patterns and identifying the scalability limits where each algorithm begins to fail. And most importantly, I validated results under realistic production conditions with simulated background system load, since benchmark performance in isolation can differ significantly from real-world deployment scenarios. This comprehensive measurement approach allowed me to quantify not just the dramatic time improvements, which measured at 75% runtime savings, but also understand the practical implications for infrastructure costs, pipeline frequency, and data scientist productivity

Assessing Clustering Quality

Assessing embedding quality represents a more complex challenge in dimensionality reduction evaluation because unlike performance metrics where faster execution times provide unambiguous improvement, embedding quality can have conflicting objectives that must be balanced across different analytical priorities.

The evaluation framework I have built addresses this complexity with six dimensions of assessment:

Local Structure Preservation: This metric ensures that users who exhibit similar patterns, engagement levels, or demographic characteristics stay grouped together during dimensionality reduction.
Global Structure Preservation: Understanding how different user segments relate to each other is crucial for strategic decisions. If high-spend users and budget users are naturally distant in behavior space, this should be reflected in the visualization.
Intrinsic Clustering Quality: This directly measures how actionable your user segments will be. Better clustering quality means clearer segment boundaries and more distinct user personas.
Ground Truth Validation: This provides the most direct assessment of whether the embedding preserves the user behavioral patterns you expect to find, ensuring insights derived from visualization accurately reflect genuine segmentation patterns.
Density Preservation: High-value customers might be sparse and unique, while mainstream users cluster densely. Preserving these patterns is crucial for understanding market structure.
Topological Structure Preservation: User behavior often follows continuous pathways - freemium to premium, casual to power user, young adult to family-oriented. Topology preservation ensures these natural progression routes are visible in the embedding.

These six metrics collectively balance the different mathematical objectives of dimensionality reduction algorithms. Running the result, I would expect to see t-SNE optimized for cluster separation and UMAP optimized for the more structural metrics which often provide more actionable business insights.

The assessment revealed UMAP as the mostly superior algorithm in 75% of run tests. UMAP consistently exceled in maintaining local neighborhood relationships (showing 2-5x better preservation), density patterns (4-13x superior), and topological connectivity, while both algorithms achieved perfect ground truth validation scores.

Although t-SNE maintains an advantage in pure clustering quality, UMAP's dominance in structural preservation metrics proves more valuable for real-world user clustering scenarios, and i would expect that gap to only widen at scale.

Business Impact

Alright now why are we doing any of this? The benefit of cluster analysis is to be able to segment data points into distinct clusters such that they might receive different business treatment. In my case this algorithm is able to identify 5 distinct clusters that describe different kinds of users. This is gold to a product team which can then customize the user experience to address the needs of their specific segment. See below the detailed cluster analysis where a random sample of 100 users fit very effectively into five groups.

While scatter plots effectively show cluster assignments, density overlap analysis reveals the probabilistic boundaries and confidence regions that standard visualizations miss.

By applying Gaussian kernel density estimation to each cluster and plotting the resulting contours, I can see where the clusters truly begin and end, rather than just showing point locations. The insight here is in the contour intersections. When you see a non-overlapping contours in indicates a clean separation with high classification confidence, but when you see an overlapping region such as between Clusters 1 and 5 it is suggesting a potential misclassification zone where data points could reasonably belong to either (or multiple) clusters. This is particularly valuable for real-world applications where understanding uncertainty is crucial. There is also insight in the contour shapes. I'm generally seeing circular patterns which suggest to me well-defined and homogeneous groups. If the shapes were more irregular it might indicate subclusters or the need for parameter adjustment.

Conclusion

Cluster analysis is a useful tool for analyzing user behavior, but at scale it can become computationally expensive. Switching from a t-SNE to a UMAP algorithm introduced topological manifolds that better preserve the existing structures and saves the algorithm a lot of work. This has allowed my program to run 9x faster while preserving and in some cases enhancing the embedding quality of the segment assignments. UMAP may not be superior in every use case though so I recommend following an evaluation methodology such as I have described if you're intending to do this in your business. That said, I'm a UMAP convert!

< Older Post

Mail

Joe Biden's AI Legacy

By William McNamara • December 11, 2024

Not a whole lot for Trump to undo

Exploring Alphafold Model (Part 2)

By William McNamara • October 15, 2024

ColabFold has changed the game for amateur protein folding analysis

Exploring Alphafold Model (Part 1)

By William McNamara • February 3, 2024

Getting started with Deepmind's revolutionary model for protein folding

Extracting Patterns from Genomic Data

By William McNamara • December 22, 2023

A continuation of genome sequencing analysis

Biden's Executive Order on AI: It's a start

By William McNamara • November 7, 2023

A good step but not nearly enough

Dimensionality Expansion for Genome Sequencing

By William McNamara • August 1, 2023

Introductory methods for genome sequencing

Fun with Spotify's API

By William McNamara • March 19, 2023

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.