batches of batch: visualizing RC interests

Tue Jan 07

in which many words are used to explain a small, ml-based data visualization

Introduction

A screenshot of a cluster from the final visualization created Screenshot of my cluster from the final visualization created. My batch-mates are pseudonym-ized.

In this post, I’ll explain how I used some fundamental machine learning techniques to visualize the interests of my (anonymized) batch-mates at the Recurse Center (RC), at the beginning of our Winter 1, 2024 (W1’24) batch. This project demonstrates:

My goal for this post is to provide an accessible explainer for this project, whether you are new to machine learning or just curious about analyzing text data!

Disclaimer: This was a quick, fun exercise that I did to warm up my coding muscles at the beginning of batch. As such, I didn’t spend any additional time on improving final outputs here. Aside from the general pitfalls of attempting to group individuals by algorithm, expect the results to be kind of wonky and not tuned for accuracy!

For more on RC, check out the first few paragraphs that I wrote here.

The What

“batches of batch” is a lightly interactive 2D/3D visualization that attempts to show how my batch-mates’ interests relate to one another, based on the content of their welcome posts and week 1 introductions. The full source code is available on GitHub: iconix/rc-batchvizExternal link icon.

This viz was inspired by the “Advice and Introductions” call on the first day of W1’24, where RC faculty advised us to take notes because:

It’s very easy to forget who said the thing you were curious about once everybody’s gone

This sparked a fun idea: what if I could create a durable “map” of everyone’s interests based on what they shared that day? This could theoretically help us all find potential collaborators and encourage connections.

As a spoiler, here’s what the live visualization looks like, in 2D (easier to manuever on bigger screens):

It shows batch-mates clustered by color, and hovering over points reveals batch-mate pseudonym, cluster assignment, and topical keywords for the cluster. You can also zoom and pan the visualization to be better see the dense clusters (see the controls at the top right). Also, one could toggle a flag on the generating script to render this in 3D (--components 3).

The How

At a very high level:

I manually scraped some text data, then pushed it through an embedding model + a dimensionality reducer + a clusterer + a topic modeler.

Neat but huh?! My batch-mate, Karen, basically replied when I gave this description in my daily RC check-in post. Fair enough!

Step 1. Gather text

some text data ➤ my notes from the “Advice and Introductions” call + W1’24 welcome topic posts

This bit is straightforward enough. I copy-pasted all the welcome posts from our messaging platform Zulip, then added my own notes from the call, to a Google Sheet and exported it to a CSV.

Step 2. Turn text into numbers (embeddings)

pushed it through an embedding model ➤ converted the text into lists of numbers that capture features (relationships and patterns) from the original data

Here’s where the ML ‘magic’ starts! Modern machine learning models can translate text (discrete, symbolic, context-dependent text) into a continuous representation that other models, and computers in general, know how to process: that is, a ton of numbers (also known as vectors). The numbers aren’t random, either: they are tuned to encode similarity and relationships between different texts. An embedding model is what performs this encoding process.

Here’s a simplified visualization of how embeddings behave in 2D:

a scatter plot of "cat" "dog" and "airplane" in embedding space

In this example, instead of trying to understand “cat” as a word, the model understands it as a point in multidimensional mathematical space. Words with similar meanings or closer relationships appear closer together - so “cat” and “dog” are closer together than “cat” and “airplane” are.

An interesting effect of this is that since they are numbers with meaning, you can do math on words essentially, like measure cosine similarityExternal link icon. Pretty neat!

The embedding model learns these relationships by seeing lots and lots of examples. Nowadays, text-based models are taught on essentially as many books and as much of the internet as possible (for better or worse). And the teaching occurs by the model predicting some objective (like predicting what words might appear nearby - aka within the context of - the word “cat”), then slightly adjusting its internal numbers (“weights”) to do better at the next round of prediction. Doing better at this objective results in tweaking the numerical representations for co-occurring text to be more and more similar.

It is pretty amazing that models trained from scratch start by making predictions totally at random and then refine from there. If interested in diving deeper into the details, this prediction and adjustment cycle is known as gradient descentExternal link icon.

Anywho, I didn’t have to teach these models from scratch here. Instead I used a pretrained, open-source neural network called Nomic Embed (nomic-embed-text-v1.5, to be exact). This model was chosen because it is available on the Hugging Face model repositoryExternal link icon, which makes usage simple, and it has strong performance in academic benchmarks (outperforming OpenAI’s paid embedding model, thanks to how it was trainedExternal link icon). It can also handle longer text sequences with its 8,192 token context window, although that wasn’t necessary for this project.

The Nomic Embed model converts each person’s associated text data into a 768-dimensional vector - meaning each person, and ideally for our purposes, their interests, are represented by 768 numbers!

Step 3. Simplify the numbers (dimensionality reduction)

+ a dimensionality reducer ➤ an algorithm that takes high-dimensional data and creates a meaningful low-dimensional representation

So now we have lists of numbers imbued with semantic meaning. These lists of K numbers represent the text they encode in K-dimensional space.

K is usually a fairly big number - like a multiple of 128. For nomic-embed-text-v1.5, K=768. So, how to make sense of this 768D web of words? One approach is to “flatten” this web into a simpler map, in a lower visual dimension (2D or 3D), which preserves the most important connections. Think of it like making a visual map of the Earth - you lose some information, but the important relationships (like which countries are near each other) stay mostly intact. This technique is called dimensionality reduction.

UMAPExternal link icon, or Uniform Manifold Approximation and Projection, is the particular algorithm I used in this project for dimensionality reduction. UMAP is good at preserving both local and global structure in the data, which should allow the reduction to generalize better. It also tends to work well with the high-dimensional embeddings produced by modern language models.

Like other machine learning approaches, dimensionality reduction involves learning patterns from data. UMAP uses techniques similar to neural networks (e.g., gradient descent!), optimizing a mathematical objective that balances preserving local and global structure.

Here’s another simplified visualization, this time for dimensionality reduction:

simple diagram showing 3D points being mapped to 2D while preserving relationships, with some information loss

Note that some relationships between points are distorted in the projection due to information loss. For example, the tight cluster in 3D gets somewhat dispersed in 2D.

You may wonder: what do the dimensions mean after UMAP is applied? Unlike other methods like Principle Component AnalysisExternal link icon (PCA), where dimensions can sometimes be interpreted, UMAP dimensions don’t have inherent meaning. Instead, they represent abstract combinations of features that UMAP found useful for preserving the important relationships in the high-dimensional space. What matters is the relative positions and distances between points, not the specific coordinate values.

Step 4. Find groups (clustering)

+ a clusterer ➤ an algorithm to identify natural groupings in the simplified number space

Once we have our 2D or 3D representation, we can find natural groups in the data with a clustering algorithm called HDBSCANExternal link icon (Hierarchical Density-Based Spatial Clustering of Applications with Noise).

HDBSCAN works by:

  1. Finding areas where points are densely packed together
  2. Identifying these dense regions as clusters
  3. Assigning points to their most likely cluster

HDBSCAN was chosen over simpler clustering algorithms like k-meansExternal link icon because it doesn’t require specifying the number of clusters in advance (which we don’t know!), can identify noise points or “outliers” that don’t belong to any cluster, and can find clusters of varying densities and shapes.

Clustering is a form of unsupervised machine learning, where the algorithm learns to identify patterns and structure in data without being given explicit examples of what constitutes a “correct” grouping. HDBSCAN learns to recognize dense regions in the data that likely represent meaningful groups.

Step 5. Understand group themes (topic modeling)

+ a topic modeler ➤ an algorithm to extract key themes and terms that characterize a group of documents

In this step, text makes a comeback!

For each cluster, we want to understand what common themes (that ideally align with interests) bind it together. We use a technique called topic modeling to automatically extract key themes for each cluster from the underlying text data that now belongs to a specific cluster.

As an exercise and for comparison with automated methods, here is how I manually interpreted the clusters by interest:

Topic modeling is another form of unsupervised machine learning that learns to discover abstract topics that occur in a collection of documents. The algorithm identifies which words tend to appear together frequently and uses this information to infer underlying themes or topics.

The results that I liked best here come from a modified TF-IDFExternal link icon method, despite it being considered an old school method. The script has a flag (--topic-method) that can be changed to run more modern methods, like KeyBERT.

I did a bit more comparison between the automated topic labeling methods available in my script, versus my own interpretation of the clusters:

So how did topic modeling do? Overall, it often “misses the forest for the trees.” Human interpretation remains valuable for synthesizing coherent, meaningful cluster labels that capture the essence of the grouped content. It just doesn’t scale as well!

Step 6. Visualize groups

The final step uses PlotlyExternal link icon to create an interactive visualization where:

The interactive bit of the visualization allows for better visualizing the clusters, which can be a bit dense, as well as exploration of the relationships between different clusters and individual data points.

Quick asides: The only data point without a pseudonym is my own: Nadja Rhodes. I also had lots of dialogue with ClaudeExternal link icon in order to generate the Plotly code.

All together now

Here’s a flow chart that puts the full pipeline together with a bit more detail. Data flows through the different steps, and each step performs a specific task, as described in the steps above.

full machine learning pipeline visualized

The first full pipeline run, from CSV, takes about a minute and a half on my laptop (no GPU) via the Python script.

The Results

So how representative of interests is this viz, really? Across the board, not terribly but also not spectacularly. The clusters seem fairly coherent after a brief inspection, but the topic labels to help with interpretation are lackluster.

Although, as seen in the first screenshot of this post, my particular cluster with topics web, personal, server, building, creative feels pretty spot-on (“works on my machine data” *closes laptop*).

Here are some other interesting patterns and data artifacts that I found:

Future improvements could include:

Notes on implementation

This project used the following Python packages:

Here is a direct link to the final visualizationExternal link icon. The full implementation is available on GitHub: iconix/rc-batchvizExternal link icon. I don’t share the underlying data used because it was shared in internal communications.

That’s all, folks! Thanks for reading.

Kudos

Special thanks to:

Resources

In machine learning nowadays, things change so fast that finding the most up-to-date resources, pretrained models, and libraries can be half the battle. The following resources helped me out a lot in this respect:

recurse

projects

ai-ml

dataviz