Hours to Minutes: Scaling User Clustering with Topological Manifold Learning
How I made a critical segmentation algorithm 9x faster with a new approach to clustering
If there is one topic I enjoy more than any other, it’s dimensionality reduction algorithms (which I have written about extensively). One of the most useful applications of dimensionality reduction is user segmentation, where machine learning can take a multitude of inputs and create better clusters of similarities than single variables that seem important but get stripped of their deeper context.
A couple of years ago, I built a user clustering algorithm that leverages t-distributed stochastic neighbor embedding (t-SNE) to predict how likely customers are to churn or expand their business. What I didn’t expect though was to run into scaling issues so quickly.
The Problem Statement
t-SNE has to consider every possible relationship between customers, but this thoroughness turns into a weakness. Imagine you’re looking at a city, and you ask every person in the city how similar they feel to another person in the city, and repeat this millions of times for every person in the city. That’s t-SNE, every time you add a new customer, the volume of calculations scales exponentially.
When I launched the program, it took a couple hours to run through our customer base, which meant running it once a day, a consolation I had come to live with. But every time we added more customers to our analysis, t-SNE's processing time didn't just increase - it exploded.
I needed to figure out how to achieve higher-performance clustering, ideally without sacrificing accuracy or run frequency.
UMAP Approach
With all of the buzz around the training of foundational models at companies like OpenAI and Anthropic, I’ve been familiarizing myself with Topological Manifold Learning which I’ve found to be a more intuitive way to think of optimizing algorithms than the traditional ways I was taught in school. In this context, topoliogical manifolds simply give greater consideration to the natural structures emerging in data. Returning to the city analogy, if instead of comparing every resident of the city to every other resident of the city, you asked how similar each person feels to their neighbor, their neighborhood in general, and the community of the broader city. It saves you and your algorithm a lot of time by preserving global and local structures that already exist.
Enter UMAP (Uniform Manifold Approximation and Projection) modeling, which represents a fundamental shift in approaching dimensionality reduction for user clustering. Instead of comparing every user to every other user like t-SNE does, UMAP builds a network graph where each user connects primarily to their nearest neighbors, then uses cross-entropy optimization to find a low-dimensional representation that preserves both these local neighborhoods and the global structure of how different user segments relate to each other.

This approach delivers dramatic computational advantages. Theoretically it reduces processing time while producing more stable and consistent results for business impact. So I decided to test this out!
Measuring Performance
Evaluating the comparative performance of UMAP will require a multi-dimensional approach that goes beyond simple speed comparisons to ensure the migration delivers genuine business value. Processing time should definitely improve, but I will also need to validate that faster clustering doesn't come at the cost of insight quality or stability; which even if faster could harm business decision making.
I’ll measure four critical dimensions of computational efficiency:
- Processing Time
- Clustering Quality
- Business Impact
This comprehensive approach involves running both algorithms on identical datasets with optimized parameters, conducting statistical significance testing to ensure observed differences aren't due to randomness, and validating that technical improvements translate to measurable business outcomes like better personalization performance or more actionable user segments.
Assessing Processing Time
I’ve built an evaluation framework that tests both algorithms across progressively larger datasets from 1,000 to 100,000 users using identical hardware configurations and multiple runs to account for statistical variation. I chose to measure not just the core clustering time, but the complete end-to-end pipeline including data loading, preprocessing, and post-processing phases, since the clustering algorithm represents only one component of the total user segmentation workflow.

The testing protocol includes statistical significance analysis using paired t-tests and confidence intervals to ensure observed speedups are genuine rather than measurement noise, while also documenting memory usage patterns and identifying the scalability limits where each algorithm begins to fail. And most importantly, I validated results under realistic production conditions with simulated background system load, since benchmark performance in isolation can differ significantly from real-world deployment scenarios. This comprehensive measurement approach allowed me to quantify not just the dramatic time improvements, which measured at 75% runtime savings, but also understand the practical implications for infrastructure costs, pipeline frequency, and data scientist productivity
Assessing Clustering Quality
Assessing embedding quality represents a more complex challenge in dimensionality reduction evaluation because unlike performance metrics where faster execution times provide unambiguous improvement, embedding quality can have conflicting objectives that must be balanced across different analytical priorities.
The evaluation framework I have built addresses this complexity with six dimensions of assessment:
- Local Structure Preservation: This metric ensures that users who exhibit similar patterns, engagement levels, or demographic characteristics stay grouped together during dimensionality reduction.
- Global Structure Preservation: Understanding how different user segments relate to each other is crucial for strategic decisions. If high-spend users and budget users are naturally distant in behavior space, this should be reflected in the visualization.
- Intrinsic Clustering Quality: This directly measures how actionable your user segments will be. Better clustering quality means clearer segment boundaries and more distinct user personas.
- Ground Truth Validation: This provides the most direct assessment of whether the embedding preserves the user behavioral patterns you expect to find, ensuring insights derived from visualization accurately reflect genuine segmentation patterns.
- Density Preservation: High-value customers might be sparse and unique, while mainstream users cluster densely. Preserving these patterns is crucial for understanding market structure.
- Topological Structure Preservation: User behavior often follows continuous pathways - freemium to premium, casual to power user, young adult to family-oriented. Topology preservation ensures these natural progression routes are visible in the embedding.
These six metrics collectively balance the different mathematical objectives of dimensionality reduction algorithms. Running the result, I would expect to see t-SNE optimized for cluster separation and UMAP optimized for the more structural metrics which often provide more actionable business insights.

The assessment revealed UMAP as the mostly superior algorithm in 75% of run tests. UMAP consistently exceled in maintaining local neighborhood relationships (showing 2-5x better preservation), density patterns (4-13x superior), and topological connectivity, while both algorithms achieved perfect ground truth validation scores.
Although t-SNE maintains an advantage in pure clustering quality, UMAP's dominance in structural preservation metrics proves more valuable for real-world user clustering scenarios, and i would expect that gap to only widen at scale.
Business Impact
Alright now why are we doing any of this? The benefit of cluster analysis is to be able to segment data points into distinct clusters such that they might receive different business treatment. In my case this algorithm is able to identify 5 distinct clusters that describe different kinds of users. This is gold to a product team which can then customize the user experience to address the needs of their specific segment. See below the detailed cluster analysis where a random sample of 100 users fit very effectively into five groups.

While scatter plots effectively show cluster assignments, density overlap analysis reveals the probabilistic boundaries and confidence regions that standard visualizations miss.
By applying Gaussian kernel density estimation to each cluster and plotting the resulting contours, I can see where the clusters truly begin and end, rather than just showing point locations. The insight here is in the contour intersections. When you see a non-overlapping contours in indicates a clean separation with high classification confidence, but when you see an overlapping region such as between Clusters 1 and 5 it is suggesting a potential misclassification zone where data points could reasonably belong to either (or multiple) clusters. This is particularly valuable for real-world applications where understanding uncertainty is crucial. There is also insight in the contour shapes. I'm generally seeing circular patterns which suggest to me well-defined and homogeneous groups. If the shapes were more irregular it might indicate subclusters or the need for parameter adjustment.

Conclusion
Cluster analysis is a useful tool for analyzing user behavior, but at scale it can become computationally expensive. Switching from a t-SNE to a UMAP algorithm introduced topological manifolds that better preserve the existing structures and saves the algorithm a lot of work. This has allowed my program to run 9x faster while preserving and in some cases enhancing the embedding quality of the segment assignments. UMAP may not be superior in every use case though so I recommend following an evaluation methodology such as I have described if you're intending to do this in your business. That said, I'm a UMAP convert!
