In this blog post, we will continue exploring the Unsplash Lite dataset with 25k photos. We will organize the photos into clusters using different algorithms, and evaluate their performance.

Since we are dealing with clustering high-dimensional embeddings generated by Transformers, a threshold-based approach or density-based methods like DBSCAN is more practical than graph-based clustering algorithms. Therefore, this post will focus on these 2 types of algorithms.

By the end of this post, we will have determined the best algorithm for this dataset.

Let’s go!

Preparing materials

retrieve images for offline use

We first download all the Unsplash Lite photos for offline use.

img_folder = 'unsplash/photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
  
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
  
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

retrieve precomputed embeddings from SentenceTransformer

We can use an embedding model to generate numerical representation of each image. A good visual transformer will be able to capture the semantic meaning of most of the image elements in the embedding.

However, to reduce compute resource usage and time, we will use precomputed image embeddings provided by the SentenceTranformer documentation instead.

import requests

url = "http://sbert.net/datasets/unsplash-25k-photos-embeddings.pkl"
response = requests.get(url)

with open("unsplash-25k-photos-embeddings.pkl","wb") as file:
    file.write(response.content)

We then deserialize the file to get both the image names and embeddings.

import pickle
emb_filename="unsplash-25k-photos-embeddings.pkl" 
with open(emb_filename, 'rb') as fIn: 
    img_names, img_emb = pickle.load(fIn)

Threshold-based clustering

Now we are ready for action!

Our first clustering will be done via setting a similarity threshold. In the following snippet, we will adopt the utility function from SentenceTransformer to perform clustering.

This method computes pairwise cosine similarities between embeddings and uses a fixed similarity threshold parameter to decide if one embedding is “close” enough to another to belong in the same community. It further requires that a candidate community meet a minimum size (min_community_size) before it’s considered valid. This is similar to DBSCAN’s min_samples but applied after a similarity evaluation. Essentially, every point is evaluated based on whether its neighbors exceed a given cosine similarity value.

The initial check cos_scores.topk(k=min_community_size) is a coarse filter meant to quickly reject embeddings that clearly don’t have enough high-similarity neighbors, thus avoiding unnecessary computation later. Only after this check is passed will the code resort to subsequent checks (within both the fast and slow paths) that are more fine-grained.

Fast Path vs. Slow Path:

  • Fast Path: If the smallest of these top candidates (top_val_large[-1]) is less than the threshold, then it’s clear that not all the potential similar embeddings are above the cutoff. We can leverage the sorted order to quickly halt when values drop below the threshold within the first init_max_size entries. This avoids scanning the entire similarity row, which boost performance. In this scenario, the community size will be no greater than init_max_size (and likely less) because you stop the moment the condition fails.
  • Slow Path: If the fast path indicates that even the smallest value in the top init_max_size is above the threshold, it suggests that there might be even more embeddings beyond those init_max_size entries that are above the threshold. It then scans the entire similarity row for this embedding. This can capture more indices, creating a larger community.

As we can see, the parameter init_max_size is used as a performance optimization rather than a hard cap on the community size. A smaller value might lead to frequent fallback (slow path), while a large value might be unnecessarily heavy if most communities are small. Adjusting the threshold also impacts the density of the communities detected.

The implementation also explicitly removes overlaps so that each embedding is only assigned to a single community.

from sentence_transformers import util

def community_detection(embeddings, threshold, min_community_size=10, init_max_size=1000):
    """
    Finds in the embeddings all communities, i.e. embeddings that are closer than threshold.

    Returns only communities that are larger than min_community_size. 
    The communities are returned in decreasing order. 
    The first element in each list is the central point in the community.
    """

    # Compute cosine similarity scores
    cos_scores = util.cos_sim(embeddings, embeddings)

    # Quick filter to remove unqualified embeddings 
    # If an embedding's top k most similar neighbors have a similarity below the threshold,
    # it signals that this embedding doesn’t have enough strong connections to qualify as a community center. 
    # This early filtering avoids wasting time on embeddings that cannot possibly form a large enough cluster.
    top_k_values, _ = cos_scores.topk(k=min_community_size, largest=True)

    # Filter for rows >= min_threshold
    extracted_communities = []
    for i in range(len(top_k_values)):
        # Initial screening
        # For each embedding, checks if the lowest similarity [-1] among its top-k scores 
        # meets or exceeds the given threshold. 
        # If it does, this embedding is a candidate for being the center of a community.
        if top_k_values[i][-1] >= threshold:
            new_cluster = []

            # the candidate passes the threshold, 
            # gathering all its neighbors that meet the threshold using a two-pronged strategy 

            # Gather init_max_size similar entries 
            top_val_large, top_idx_large = cos_scores[i].topk(k=init_max_size, largest=True)
            top_idx_large = top_idx_large.tolist()
            top_val_large = top_val_large.tolist()

            # FAST path: check last element of init_max_size items, if this smallest value < threshold, 
            # it means the list of top values contains a drop under the threshold somewhere.
            # check each and break when the boundary (drop) in the list is reached
            if top_val_large[-1] < threshold:
                for idx, val in zip(top_idx_large, top_val_large):
                    if val < threshold:
                        break

                    new_cluster.append(idx)
            else:
            # SLOW PATH: even the smallest of the top-selected init_max_size values > threshold.
            # There could well be more embeddings (beyond the initial init_max_size) w/ similarity scores > threshold.
            # Thus we cannot rely solely on the pre-filtered list top_val_large with init_max_size items.
            # need to check the entire original row in cos_scores for embedding i  
            # ensures that no eligible embedding is missed
                # Iterate over all entries (slow)
                for idx, val in enumerate(cos_scores[i].tolist()):
                    if val >= threshold:
                        new_cluster.append(idx)

            extracted_communities.append(new_cluster)

    # when the communities are returned, 
    # the ones containing more embeddings (and potentially representing more significant clusters) are prioritized.
    extracted_communities = sorted(extracted_communities, key=lambda x: len(x), reverse=True)

    # remove overlapping communities (the same embedding should not be a member of more than one community)
    unique_communities = []
    extracted_ids = set()

    for community in extracted_communities:
        add_cluster = True
        for idx in community:
            if idx in extracted_ids:
                add_cluster = False
                break

        if add_cluster:
            unique_communities.append(community)
            for idx in community:
                extracted_ids.add(idx)

    return unique_communities 

Now we’re ready to run the function. Recall that the threshold parameter determines whether an image is deemed close enough to be a neighbor. Smaller thresholds relax this criterion and generate bigger clusters with more distinct images in it. min_community_size governs the minimal size of a cluster.

clusters = community_detection(img_emb, threshold=0.9, min_community_size=10)
print("Number of clusters found:", len(clusters))

which gives Number of clusters found: 147

Now let’s examine the biggest clusters by retrieving their member photos.

from IPython.display import display, HTML
import os

# output the first 10 (largest) clusters
for cluster in clusters[0:10]:
    print("\n\nCluster size:", len(cluster))
  
    html_string = """
    <div style="display: flex; flex-direction: row; gap: 10px; align-items: flex-start;">
    """

    for idx in cluster[0:5]:
        img_path = os.path.join(img_folder, img_names[idx])
        # Setting max-width to 200px and height to auto ensures that each image is scaled by width while preserving its natural aspect ratio.
        html_string += f"<img src='{img_path}' style='max-width: 200px; height: auto;'/>"

    html_string += "</div>"

    display(HTML(html_string))

threshold

noise analysis

For threshold-based community detection, it’s common to see a lot of points labeled as noise. We can compute the proportion of noise points and examine their average cosine similarity to their nearest (most similar) cluster. If many noise points have high similarity to one of the clusters (i.e. above or near threshold), that suggests we have a too-high threshold and are discarding semantically related points. Conversely, very low similarity values would indicate that these points are truly outliers.

In the following code, we collect cluster centers (using the first index from each cluster), and identify any embedding that is not assigned to any cluster. For each of these noise embeddings, we compute its cosine similarity to each cluster center and take the max (nearest cluster). We then average these max values.

def threshold_noise(clusters):
  
    all_indices = set(range(len(img_emb)))

    # Collect all indices that are assigned to a community.
    assigned_indices = set()

    # `clusters` is the list of clusters output by community_detection (each cluster is a list of indices)
    for community in clusters:
        assigned_indices.update(community)

    # Noise points are those indices not assigned to any community.
    noise_indices = list(all_indices - assigned_indices)

    # --- Compute cluster centers ---
    # Use the first element of each community as the representative center.
    cluster_centers = []
    for community in clusters:
        cluster_centers.append(img_emb[community[0]])
    cluster_centers = np.vstack(cluster_centers)  # shape: (n_clusters, embedding_dim)

    # --- For noise analysis ---
    # Extract noise embeddings
    noise_embeddings = img_emb[noise_indices]

    # Compute the cosine similarity between each noise embedding and each cluster center
    similarity_matrix = cosine_similarity(noise_embeddings, cluster_centers)
    # For each noise point, take the maximal similarity
    max_similarities = np.max(similarity_matrix, axis=1)
    # Compute the average similarity
    avg_noise_similarity = np.mean(max_similarities)

    print("Threshold-Based Community Detection:")
    print(f"Total embeddings: {len(img_emb)}")
    print("Number of communities:", len(clusters))
    print("Number of noise points:", len(noise_indices))
    print("Average cosine similarity of noise points to their nearest cluster center:", avg_noise_similarity)

    # Visualize the distribution of max similarities:
    plt.hist(max_similarities, bins=30, alpha=0.7, edgecolor='black')
    plt.title("Distribution of Maximum Cosine Similarities (Noise Points to Nearest Cluster Center)")
    plt.xlabel("Cosine Similarity")
    plt.ylabel("Count")
    plt.show()

threshold_noise(clusters)

which gives

Total embeddings: 24996
Number of communities: 147
Number of noise points: 19372
Average cosine similarity of noise points to their nearest cluster center: 0.82987434

We can also visualize the full distribution using histogram to better understand the spread. threshold_noise

We can see that in order to create homogenous and tightly-knitted clusters with a very high threshold, a lot of images that are in fact quite similar (~0.83) are thrown out as noises.

When we relax the threshold to 0.8 and rerun community_detection, we will get fewer clusters and points being thrown away, and the noises are naturally a bit more dissimilar than before.

Number of communities: 82
Number of noise points: 13752
Average cosine similarity of noise points to their nearest cluster center: 0.7519327

Pros and cons of threshold clustering

Threshold clustering is simple to implement as it directly leverages cosine similarity scores to determine membership. However, pairwise cosine similarity is compute-intensive. Thus scalability is a real concern. When the dataset grows very large, computing a full similarity matrix grows quadratically with the number of embeddings, generating bottleneck.

UMAP plot with all image embeddings

It is also helpful to visualize the clustering of our data. To do so, let’s again use the UMAP dimension reduction algorithm to reduce our image embeddings from our original dimension of 512 (determined by CLIP) to 2.

reducer = umap.UMAP(n_components=2, random_state=42)
embeddings_2d = reducer.fit_transform(img_emb)

The reduced embeddings are then fed to Bokeh to plot them in an interactive chart. As a bonus, we will also implement hovering so that when we hover over a dot of an embedding, the respective image and cluster number will be shown.

 
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.transform import factor_cmap
from bokeh.palettes import Category20, viridis
import glasbey
 
# Create a cluster label array for all embeddings - initialize with -1 (for noise or unclustered)
num_embeddings = img_emb.shape[0]
cluster_labels = -1 * np.ones(num_embeddings, dtype=int)

# For each detected community (cluster), assign a unique integer label.
for label, community in enumerate(clusters):
    for idx in community:
        cluster_labels[idx] = label

# ----------------- Prepare Data for Bokeh -----------------
data = {
    "x": embeddings_2d[:, 0],
    "y": embeddings_2d[:, 1],
    # Build a file path (or URL) for each image:
    "img_url": [os.path.join(img_folder, name) for name in img_names],
    "cluster": cluster_labels.astype(str)  # Convert to string for categorical mapping
}

source = ColumnDataSource(data=data)

# Get the unique (non-noise) cluster labels as strings.
unique_labels = sorted(list(set(cluster_labels[cluster_labels != -1].astype(str))) 
                         if np.any(cluster_labels != -1) else ["-1"])
num_clusters = len(unique_labels)

# Generate Glasbey Palette for categorical data.
# Here, we generate exactly n_clusters distinct colors.
palette = glasbey.create_palette(palette_size=num_clusters)

# Create the Hover Tool 
hover = HoverTool(tooltips="""
    <div>
        <div>
           <img src="@img_url" alt="" style="width:150px; border: 2px solid #ddd;" />
        </div>
        <div>
           <span style="font-size: 12px; font-weight: bold;">@img_url</span>
        </div>
        <div>
           <span style="font-size: 10px;">Cluster: @cluster</span>
        </div>
    </div>
""")

# Create the Bokeh Plot 
p = figure(title="UMAP Projection with Threshold Clustering using Glasbey ",
           tools=[hover, "pan,wheel_zoom,reset"],
           width=800, height=600)

p.circle('x', 'y', source=source, size=5, alpha=0.6,
         color=factor_cmap('cluster', palette=palette, factors=unique_labels))

show(p)

thresholdPlot

We can then hover over a region to see what the concentration is about. Zooming in to a specific area, we can confirm that the nearby embeddings are indeed very similar images.

thresholdPlot2

Viewing the plot inline in a Jupyter notebook might feel cramped. We can output the plot in an HTML file to explore it in a bigger viewport. Then we can freely pan, zoom and hover in a browser.

from bokeh.plotting import output_file, show
from bokeh.layouts import layout

# Instead of output_notebook(), use output_file to write to an HTML file.
output_file("umap_plot.html")

# Wrap the figure in a layout with sizing_mode "stretch_both" so it expands to fill the browser window.
layout_plot = layout([p], sizing_mode="stretch_both")

# Display the plot in browser
show(layout_plot)

Evaluating the cluster quality

defining metrics

Clustering algorithms are fundamentally unsupervised learning methods. Therefore, evaluating their performance is challenging because the “correctness” of the clustering isn’t provided beforehand. This is because unlike using the make_blob function in sklearn, a real world dataset for clustering usually do not have labeled data to compare the ground truth with. This excludes metrics such as ARI, NMI, Homogeneity, Completeness, and V-measure. Instead, we have to evaluate our clustering in terms of how well it separates the data.

Since we are using a threshold algorithm based on cosine similarity, we will choose metrics that respect that distance notion. Such metrics include silhouette score (with cosine distance), Calinski-Harabasz, Davies-Bouldin, and direct comparisons of intra- and inter-cluster cosine similarities.

  • Silhouette Score: This is often a preferred overall measure of cluster quality. It measures for each point how similar it is to points within its cluster compared to points in other clusters. A higher average silhouette score (close to 1) indicates better intra-cluster cohesion and inter-cluster separation.
  • Calinski-Harabasz Index: This is the ratio of between-cluster dispersion to within-cluster dispersion. Higher values generally indicate better-defined clusters, although CH tends to favor results with more clusters.
  • Davies-Bouldin Index: This index computes the average “similarity” between each cluster and its most similar one, where similarity is defined as a ratio of intra-cluster to inter-cluster distances. Lower values mean more compact and distinct clusters.
  • Intra vs. Inter-Cluster Cosine Similarity: A high intra-cluster similarity (similarity between points within the same cluster) and a low inter-cluster similarity (similarity between points from different cluster) indicate robust clusters.

evaluation code

We will measure silhouette, Calinski-Harabasz and Davies-Bouldin scores using the scikit-learn library.

Note: When using silhouette_score with a precomputed distance matrix based on cosine similarity, ensure that the distance is properly defined as (1 - (cosine similarity)).

import numpy as np
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity
cos_sim = cosine_similarity(img_emb)

# Convert similarity to distance (1 - cosine similarity)
distance_matrix = 1 - cos_sim

# Clip any slightly negative values due to floating point precision issues
distance_matrix = np.clip(distance_matrix, 0, None)

# Now compute the metrics  
ch_score = calinski_harabasz_score(img_emb, cluster_labels)
db_score = davies_bouldin_score(img_emb, cluster_labels)
sil_score = silhouette_score(distance_matrix, cluster_labels, metric="precomputed")
print("Silhouette Score:", sil_score)
print("Calinski-Harabasz Score:", ch_score)
print("Davies-Bouldin Score:", db_score)

which outputs

Silhouette Score: -0.15987909
Calinski-Harabasz Score: 41.73438865857112
Davies-Bouldin Score: 2.226699601194434

Let’s break down what these score values mean:

  1. Silhouette Score: -0.15987909
    • The silhouette score ranges from –1 to 1. A negative value means that, on average, points are closer to a cluster other than the one to which they’ve been assigned. –0.16 suggests that many points might be mis-assigned or that clusters overlap significantly.
  2. Calinski-Harabasz Score: 41.73438865857112
    • Even though a higher CH score indicates well-separated clusters, its absolute values depend strongly on the data’s scale, dimensionality, and structure; 41.73 is relatively low compared to what we might expect for clearly separated clusters in a dataset of this size.
  3. Davies-Bouldin Score: 2.226699601194434
    • Lower values are better, with values closer to 0 indicating distinct clusters. A DB score of about 2.23 tends to indicate that the clusters are not very well separated.

Each metric, although computed on different scales, indicates a consistent trend:

  • The negative silhouette score indicates that many points seem closer to clusters other than their own (i.e., the overall clustering structure is weak or overlapping).
  • The moderate-to-low CH score suggests that the separation between clusters is not very pronounced relative to the cluster compactness.
  • The DB index value above 2 also implies that, on average, clusters are not clearly distinct (lower values would be ideal).

We can thus conclude that the clustering from threshold‑based community detection isn’t capturing well-separated or clearly defined groups. This might be due to the threshold settings setting too high at 0.9.

Parameter tuning

Choosing the right similarity threshold is critical as the quality of resulting clusters is very sensitive to this value.

Moreover, the dual-path mechanism (fast vs. slow path) is designed to balance performance and completeness, but it also adds complexity in tuning parameters like init_max_size.

We can experiment with multiple thresholds and use the metrics above to see which setting maximizes intra-cluster similarity and minimizes inter-cluster similarity. The following table shows the results by rerunning the community_detection function with different threshold values and generate the 3 metrics for each cycle.

Threshold# ClustersSilhouette ScoreCalinski–Harabasz ScoreDavies–Bouldin ScoreInterpretation
0.882+0.01292.019.10Lenient Criteria: Fewer and larger clusters are formed. The high CH score indicates strong global separation, but the high DB score shows that clusters are internally heterogeneous. The near-zero silhouette suggests ambiguous membership (points are almost equally close to other clusters).
0.85154–0.236832.482.10Intermediate Criteria: A very high number of small, tight clusters is created. The low DB score indicates good internal compactness, but the dramatically lower CH score and negative silhouette (–0.2368) suggest that clusters are too fragmented and many points are mis-assigned relative to the natural grouping of the data.
0.9147–0.159941.732.23Strict Criteria: The clusters remain tight (as seen by the low DB score) and slightly fewer are produced compared to 0.85, with a modest improvement in CH and silhouette scores compared to 0.85. However, the overall silhouette remains negative, indicating that many points are still closer to clusters other than their own, though not as severe as at 0.85.

Since the results are not very satisfactory, let’s explore alternative clustering methods like DBSCAN and HDBSCAN, which adjust to local densities. Let’s see if they might perform better.

Density-based clustering

Density-based clustering methods locate regions of high density that are separated from one another by regions of low density.

DBSCAN

DBSCAN is a robust general-purpose clustering algorithm that is excellent at discovering arbitrary-shaped clusters. It groups points that have a sufficiently large “dense” neighborhood. Instead of an absolute similarity cutoff per connection, it requires that a point has at least a certain number of neighbors (min_samples) within a given distance (eps).

This StatQuest video explains clearly the intuition of the DBSCAN algorithm.

DBSCAN requires two parameters: an epsilon (ε - eps) distance threshold to decide whether one point is in the neighborhood of another, and a minimum number of points (min_samples) to form a dense cluster. Any point with enough neighbors (within ε) becomes a core point, and clusters are formed by connecting these dense regions. Points that do not belong to any dense region are treated as noise (labeled -1).

Note that if our similarity measure is cosine similarity, we will transform it into a distance measure (e.g., distance = 1 − cosine similarity) for the eps parameter. For example, if our desired similarity threshold is 0.9, we can set eps = 1 - 0.9 = 0.1. This tells DBSCAN that points within a cosine distance of 0.1 (i.e. with similarity at least 0.9) are considered neighbors.

import numpy as np
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.1, min_samples=10, metric='cosine')
db_cluster_labels = dbscan.fit_predict(img_emb)

We can then check how many clusters are generated by len(set(db_cluster_labels))-1, which takes into account the -1 noise cluster. We will get 27 clusters back.

We can visualize the largest clusters as following:

dbscan

noise analysis

We will run noise analysis on our clusters to see the distribution of noise points. DBSCAN labels “noise” points with -1. We can perform a similar analysis by first computing a centroid (the mean of all embeddings) for each detected cluster. Similar to threshold noise analysis, we calculate each noise point’s cosine similarity to all the computed centroids and pick the highest similarity. The average of these maximum similarities provides insight into whether many noise points lie near an existing cluster. A high value indicates that eps or min_samples might be too restrictive.

# --- Compute cluster centroids for DBSCAN clusters ---
unique_labels = np.unique(db_labels)
cluster_centers_db = []
for label in unique_labels:
    if label == -1:
        continue
    # Indices of points in this cluster
    indices = np.where(db_labels == label)[0]
    # Compute the centroid by taking the mean vector
    centroid = np.mean(img_emb[indices], axis=0)
    cluster_centers_db.append(centroid)
cluster_centers_db = np.vstack(cluster_centers_db)

# Noise points have label -1
noise_indices_db = np.where(db_labels == -1)[0]
noise_embeddings_db = img_emb[noise_indices_db]

# Compute cosine similarity between each noise embedding and each cluster centroid
similarity_matrix_db = cosine_similarity(noise_embeddings_db, cluster_centers_db)
# For each noise point, take the maximal similarity
max_similarities_db = np.max(similarity_matrix_db, axis=1)
# Compute the average similarity
avg_noise_similarity_db = np.mean(max_similarities_db)

print("DBSCAN Clustering:")
print(f"Total points: {len(img_emb)}")
print("Number of clusters:", len(cluster_centers_db))
print("Number of noise points:", len(noise_indices_db))
print("Average cosine similarity of noise points to their nearest cluster centroid:", avg_noise_similarity_db)

# Visualize the distribution of max similarities:
plt.hist(max_similarities_db, bins=30, alpha=0.7, edgecolor='black')
plt.title("Distribution of Cosine Similarities\n(Noise points to nearest DBSCAN cluster centroid)")
plt.xlabel("Cosine Similarity")
plt.ylabel("Count")
plt.show()

which gives the followings. We can see that fewer points are being thrown away as noise, and the average similarity is also lower than the threshold algorithm.

Number of clusters: 26
Number of noise points: 10777
Average cosine similarity of noise points to their nearest cluster centroid: 0.78325516

dbscan_noise

comparing DBSCAN with threshold clustering

Notice that the threshold method finds 147 clusters with the biggest clusters having 482, 401, 360 images. In comparison, DBSCAN generates only 26 clusters with the biggest cluster with 12818 images (most are nature images but not very similar, and 2nd cluser having 416 images). We can visualize the first and last 3 images in the biggest clusters from both algorithms.

threshold_firstlast First and last 3 images in the biggest cluster formed by threshold algorithm

dbscan_firstlast First and last 3 images in the biggest cluster formed by DBSCAN algorithm

Recall that for threshold clustering, the first element in each list is the central point in the community. We can see that even the farthest images from this center is semantically very similar to the center (snowy mountains shrounded in mist). Whereas the images in DBSCAN’s biggest clusters are only loosely united by a theme - nature. Since both use the same cosine-similarity as threshold and both have the same min-sample size, why is there such as big discrepancy? And why does the first DBSCAN cluster have such an overwhelming number of images?

connectivity

Threshold clustering tends to create localized clusters where every member is directly similar to a central point (by using the pairwise similarity threshold check), which leads to multiple smaller, relatively homogeneous clusters. We can gain the intuition by the screenshot above, even the the last image in the cluster (the least similar one) passes the threshold of having a cosine similarity of 0.9 or above with the first image.

DBSCAN, on the other hand, uses chain connectivity to detect connected regions (dense areas) of points: if point A is close enough to point B, and point B is close enough to point C, then even if A and C are not mutually similar, they will be chained to the same cluster. A point might not be directly similar to every other point in the cluster as long as there exists a “path” of neighboring points where each link respects eps. With a prevalent theme (a wide variety of nature images) in the Unsplash Lite dataset, the slight variations among them connect one and other in continuous connections. One image barely meets the threshold with its neighbor, and that neighbor connects to another, and so on. This transitive connectivity form larger, sometimes more heterogeneous clusters, which is why DBSCAN produced a cluster with 12,818 images. Thus we observe that the 1st photo in the DBSCAN cluster (aerial view of sunny beach) is quite different than flowers or waterfalls, even though they are all nature-themed. It isn’t that all these images are extremely similar pairwise; rather, they are linked together through a series of moderate similarity relationships.

centroid

In threshold-based approach, we explicitly designate a “central point” for each cluster. Specifically, when the algorithm starts gathering points from a given embedding’s row in the similarity matrix, that embedding serves as the “seed” or central point for that cluster.

DBSCAN does not compute a single, overarching centroid for each cluster. Instead, it defines core points—points that have at least the minimum number of neighbors within eps. In DBSCAN, there are 3 types of points

  • Core Points: These are the points that lie in regions of high density.
  • Border Points: These are points that are close enough to a core point to be included in a cluster, even though they themselves might not have enough close neighbors.
  • Outliers: points that do not belong to any cluster

Although DBSCAN identifies core points, it does not select a single core point as the “center.” Multiple core points exist in a dense region, and the algorithm simply labels all points belonging to the same connected dense region as part of one cluster.

number of clusters

Threshold method examines similarity on a more localized, one-to-one basis. It prevents loosely connected points from merging into one enormous group, ensuring clusters are internally very similar.

In contrast, DBSCAN’s clustering is global in the sense that once a chain of density-connected points is initiated, it doesn’t care if some of the points in the chain are only marginally similar to each other, as long as they meet the connectivity condition. This global propagation is why we end up with only 26 clusters overall—with one massive cluster (12,818 images) and less cohesive encompassing a wide range of images (predominantly nature images) that are connected through these density chains.

what does it mean for our image embeddings

Images processed by models like CLIP-VIT-B-32 tend to form dense regions in the embedding space when there is a common theme (such as many nature photos). The threshold method produces a balanced distribution of clusters sizes that are tight around a central, very similar core, giving us semantically homogeneous groups even if they are smaller in size. DBSCAN can sometimes produce overly large clusters due to chaining, if the density across the space doesn’t have clear boundaries.

evaluating the DBSCAN clusters

Running the same evaluation metrics

ch_score = calinski_harabasz_score(img_emb, db_cluster_labels)
db_score = davies_bouldin_score(img_emb, db_cluster_labels)
sil_score = silhouette_score(distance_matrix, db_cluster_labels, metric="precomputed")
print("Silhouette Score:", sil_score)
print("Calinski-Harabasz Score:", ch_score)
print("Davies-Bouldin Score:", db_score)

gives us the following result.

Silhouette Score: -0.08740608
Calinski-Harabasz Score: 69.0999049076129
Davies-Bouldin Score: 2.184908976650931

Parameter tuning

The choice of eps and min_samples greatly determines the clusters found. A too-small eps might fragment clusters, while a too-large eps might merge distinct clusters.

To quickly loop thru different eps values and see how it impacts the scores, we can implement a grid search algorithm as below:

def compute_metrics(embeddings, labels, score_metric='cosine'):
    unique_clusters = np.unique(labels)
    n_samples = embeddings.shape[0]
    if (len(unique_clusters) <= 1) or (len(unique_clusters) == n_samples):
        return np.nan, np.nan, np.nan
    try:
        sil = silhouette_score(embeddings, labels, metric=score_metric)
    except Exception:
        sil = np.nan
    try:
        ch = calinski_harabasz_score(embeddings, labels)
    except Exception:
        ch = np.nan
    try:
        db = davies_bouldin_score(embeddings, labels)
    except Exception:
        db = np.nan
    return sil, ch, db

def grid_search_eps(embeddings, eps, min_samples=10, metric='cosine'):
    results = []
    for e in eps:
        dbscan = DBSCAN(eps=e, min_samples=min_samples, metric=metric)
        db_cluster_labels = dbscan.fit_predict(embeddings)
        sil, ch, db = compute_metrics(embeddings, db_cluster_labels, score_metric=metric)
        results.append({
            'eps': e,
            'n_clusters': len(np.unique(db_cluster_labels)),
            'silhouette': sil,
            'CH': ch,
            'DB': db
        })
        print(f"eps: {e:.2f} -> {len(np.unique(db_cluster_labels))} clusters, Silhouette: {sil:.4f}, CH: {ch:.2f}, DB: {db:.2f}")
    return pd.DataFrame(results)

eps = np.linspace(0.1, 0.3, num=21)
print("=== GRID SEARCH RESULTS ===")
grid_results = grid_search_eps(img_emb, eps, min_samples=10, metric='cosine')
print(grid_results)

This automates the process and gives us the following results.

=== GRID SEARCH RESULTS ===
eps: 0.10 -> 27 clusters, Silhouette: -0.0874, CH: 69.10, DB: 2.18
eps: 0.11 -> 27 clusters, Silhouette: -0.0536, CH: 59.99, DB: 2.26
eps: 0.12 -> 22 clusters, Silhouette: -0.0643, CH: 41.12, DB: 2.34
eps: 0.13 -> 12 clusters, Silhouette: -0.0414, CH: 53.69, DB: 2.68
eps: 0.14 -> 7 clusters, Silhouette: -0.0061, CH: 64.67, DB: 3.08
eps: 0.15 -> 5 clusters, Silhouette: 0.1071, CH: 84.96, DB: 3.53
eps: 0.16 -> 3 clusters, Silhouette: 0.1680, CH: 143.37, DB: 4.56
eps: 0.17 -> 3 clusters, Silhouette: 0.1844, CH: 129.72, DB: 4.32
eps: 0.18 -> 3 clusters, Silhouette: 0.1973, CH: 117.91, DB: 4.14
eps: 0.19 -> 3 clusters, Silhouette: 0.2105, CH: 104.77, DB: 3.97
eps: 0.20 -> 2 clusters, Silhouette: 0.2376, CH: 143.06, DB: 4.72
eps: 0.21 -> 2 clusters, Silhouette: 0.2533, CH: 122.00, DB: 4.47
eps: 0.22 -> 2 clusters, Silhouette: 0.2674, CH: 99.57, DB: 4.30
eps: 0.23 -> 2 clusters, Silhouette: 0.2797, CH: 85.23, DB: 4.13
eps: 0.24 -> 2 clusters, Silhouette: 0.2875, CH: 71.11, DB: 4.03
eps: 0.25 -> 2 clusters, Silhouette: 0.2976, CH: 58.36, DB: 3.89
eps: 0.26 -> 2 clusters, Silhouette: 0.3082, CH: 47.87, DB: 3.78
eps: 0.27 -> 2 clusters, Silhouette: 0.3190, CH: 38.07, DB: 3.64
eps: 0.28 -> 2 clusters, Silhouette: 0.3249, CH: 30.76, DB: 3.55
eps: 0.29 -> 2 clusters, Silhouette: 0.3362, CH: 23.39, DB: 3.42
eps: 0.30 -> 2 clusters, Silhouette: 0.3473, CH: 18.19, DB: 3.28

We can clearly see that around eps 0.2, we strike a balance among Silhouette, CH and DB. Specifically, the Silhouette score increases steadily from a negative value at eps = 0.10 to a maximum of 0.3473 at eps = 0.30. This suggests that as eps increases, the clusters become more cohesive and better separated. DB score rises and then drops across the eps values: it goes from 2.18 at eps = 0.10 to 3.28 at eps = 0.30, at eps = 0.30 is the lowest among the higher eps settings. Similarly, CH score has a nonlinear response. It peaks around eps = 0.16 (with a value of 143.37) and eps = 0.20 (143.06), and then steadily decreases to 18.19 at eps = 0.30. Since CH rewards more clusters, so when the algorithm merges many points into just 2 clusters at higher eps values, CH will naturally drop.

Therefore, if we want to maximize overall cohesion and separation (per the highest silhouette score and the lowest Davies-Bouldin index), and if we are ok with just 2 clusters, eps = 0.30 is the top candidate.

We can also rerun the grid search and change min_samples to 20.

  • min_samples = 10
    • As eps increases from 0.10 to 0.30, the silhouette score steadily increases from a negative (poor clustering) to a maximum of 0.3473 at eps = 0.30.
    • The number of clusters drops from 27 down to 2 clusters at higher eps values (starting at eps ≈ 0.20).
    • The Calinski-Harabasz (CH) index, which tends to favor more clusters, is high at intermediate eps values (e.g., 143.37 at eps = 0.16) but falls drastically as eps increases (CH = 18.19 at eps = 0.30).
    • The Davies–Bouldin (DB) index, where lower is better, worsens (increases) with eps up to around 0.16 (DB = 4.56) but then slightly improves to 3.28 at eps = 0.30.
  • min_samples = 20
    • A similar trend is observed where the silhouette improves with increased eps, reaching 0.3467 at eps = 0.30.
    • With min_samples = 20, the early eps values (0.10–0.14) yield a higher number of clusters (16–7 clusters), but then a sharp drop occurs: eps = 0.15 already yields just 2 clusters .
    • DB and CH follow trends similar to the min_samples = 10 group, with slightly different absolute values.
  • Silhouette Score:
    • For both min_samples groups, the silhouette steadily increases and peaks at eps = 0.30 (≈0.347) which suggests that points in clusters are on average well separated.
  • Calinski-Harabasz (CH) Index:
    • Higher values generally imply better separation, but when the number of clusters is very low (e.g., 2 clusters), CH can drop even if the within-cluster cohesion is good.
    • Thus, a drop in CH at eps = 0.30 isn’t necessarily a problem if the silhouette is maximized.
  • Davies–Bouldin (DB) Index:`
    • At eps = 0.30, DB is 3.28 (min_samples = 10) or 3.3 (min_samples = 20), which are acceptable relative to higher DB values at intermediate eps.
  • Number of Clusters:`
    • The grid search shows a trade-off: with lower eps values we get more clusters but with negative or low silhouette scores.
    • With higher eps (eps >= 0.20), we end up with only 2 clusters.

Conclusion

Based on the result, we should pick the followings values for our parameters: eps = 0.30 and min_samples = 10, yielding Clusters: 2 Silhouette: 0.3473 CH: 18.19 DB: 3.28

Alternatively, if we want finer segmentation with 3 clusters, we can pick eps between 0.16 and 0.19 with min_samples = 10, but with a slightly lower silhouette (≈0.168–0.2105). This configuration shows a moderate silhouette improvement over the very low eps values while retaining an extra cluster.

HDBSCAN

Now let’s try HDBSCAN which extends DBSCAN to find clusters of varying densities. The H in HDBSCAN stands for hierarchical which is an additional component to merge too small clusters.

HDBSCAN first builds a hierarchy of clusters based on density differences, then condenses the hierarchy by evaluating cluster stability, and finally extracts meaningful clusters while labelling noise. When working with embeddings, we often encounter regions in the vector space that have very different densities. HDBSCAN excels here because it doesn’t force a one-size-fits-all distance threshold. Instead, it lets clusters form naturally where the data is dense, while keeping outliers separate.

It also removes the need for a fixed global similarity/distance threshold like eps, but rather uses a hierarchical approach to let the data “speak for itself.” This often leading to more balanced and nuanced clustering results.

The hyperparameters are

  • metric: different distance metrics to qualify a neighbor.
  • min_cluster_size: Controls the smallest size for a cluster.
  • min_samples: Helps determine core points (points that have enough neighbors). Lower min_samples can help detect smaller or tighter clusters
import hdbscan
clusterer = hdbscan.HDBSCAN(
    metric='euclidean',
    min_cluster_size=10,   
    min_samples=3,  
    core_dist_n_jobs=-1    # Use all available CPU cores
)
clusterer.fit(img_emb)

print("Found clusters:", set(clusterer.labels_))

With this simple run, we get Found clusters: {0, 1, 2, 3, -1} indicating that 4 clusters are found.

We can visualize the largest clusters as following:

hdbscan

We can also visualize the first and last 3 images in the biggest cluster. Again we see that the images are not very cohesive, but span quite a broad range of semantic meaning.

hdbscan_firstlast First and last 3 images in the biggest cluster formed by HDBSCAN algorithm

noise analysis

Rerunning the same noise analytics snippet from DBSCAN gives us this result for our HDBSCAN clusters

HDBSCAN Clustering:
Number of clusters: 4
Number of noise points: 2149
Average cosine similarity of noise points to their nearest cluster centroid: 0.7218417

As there are even fewer clusters with bigger sizes than DBSCAN, naturally the number of noise points are further reduced.

Parameter tuning

Let’s implement a grid search again for HDBSCAN to loop thru various values for metric, min_cluster_size, min_samples and cluster_selection_method parameters to see which combination gives the best performance.

Specifically, our evaluation suite spans two similarity metrics (cosine and euclidean), three values for min_cluster_size (10, 20, and 30), three values for min_samples (3, 5, and 10), and two cluster selection methods (“eom” and “leaf”).

We will reuse the compute_metrics defined in the previous section to calculate the metrics.

def grid_search_hdbscan(embeddings, param_grid, D_cosine=None, X_cosine=None):
    """
    Grid search over HDBSCAN parameters while reusing a precomputed cosine distance
    matrix (D_cosine) and normalized embeddings (X_cosine) if metric == 'cosine'.
    If using 'euclidean', the original embeddings are used.
    """
    results = []
    keys = list(param_grid.keys())
    for values in itertools.product(*param_grid.values()):
        params = dict(zip(keys, values))
        metric = params['metric']
  
        if metric == 'cosine':
            # Use precomputed cosine distances and normalized embeddings for scoring.
            X_input = D_cosine          # precomputed distance matrix
            run_metric = 'precomputed'  # tell HDBSCAN the distances are precomputed
            score_metric = 'cosine'
            X_for_scoring = X_cosine      # normalized data for silhouette, etc.
        else:
            X_input = embeddings
            run_metric = metric
            score_metric = 'euclidean'
            X_for_scoring = embeddings

        try:
            clusterer = hdbscan.HDBSCAN(metric=run_metric,
                                        min_cluster_size=params['min_cluster_size'],
                                        min_samples=params['min_samples'],
                                        cluster_selection_method=params['cluster_selection_method'],
                                        core_dist_n_jobs=-1)
            clusterer.fit(X_input)
            labels = clusterer.labels_
            sil, ch, db = ccompute_metrics(X_for_scoring, labels, score_metric=score_metric)
            n_clusters = len(np.unique(labels))
            result = {
                'metric': metric,
                'min_cluster_size': params['min_cluster_size'],
                'min_samples': params['min_samples'],
                'cluster_selection_method': params['cluster_selection_method'],
                'n_clusters': n_clusters,
                'silhouette': sil,
                'CH': ch,
                'DB': db,
            }
            results.append(result)
            print(f"metric: {metric}, min_cluster_size: {params['min_cluster_size']}, "
                  f"min_samples: {params['min_samples']}, method: {params['cluster_selection_method']}, "
                  f"clusters: {n_clusters}, sil: {sil:.3f}, CH: {ch:.1f}, DB: {db:.3f}")
        except Exception as e:
            print(f"Error for parameters {params}: {e}")
            results.append({
                'metric': metric,
                'min_cluster_size': params['min_cluster_size'],
                'min_samples': params['min_samples'],
                'cluster_selection_method': params['cluster_selection_method'],
                'n_clusters': np.nan,
                'silhouette': np.nan,
                'CH': np.nan,
                'DB': np.nan,
            })
    return pd.DataFrame(results)

# --- Create Global Precomputed Data for Cosine Metric ---
# For cosine, normalize the embeddings.
X_cosine = img_emb / np.linalg.norm(img_emb, axis=1, keepdims=True)
D_cosine = (1 - cosine_similarity(X_cosine)).astype(np.float64)

# --- Define the parameter grid ---
param_grid = {
    'metric': ['euclidean', 'cosine'],
    'min_cluster_size': [10, 20, 30],
    'min_samples': [3, 5, 10],
    'cluster_selection_method': ['eom', 'leaf']
}

# --- Run the grid search ---
print("=== Grid Search for HDBSCAN Parameters ===")
results_df = grid_search_hdbscan(img_emb, param_grid, D_cosine=D_cosine, X_cosine=X_cosine)

From our run, we notice the followings:

  • Metric Choice:
    • Cosine clearly outperforms euclidean here, as the best silhouette scores with the cosine metric are around 0.182–0.185 versus 0.03–0.09 with euclidean.
  • Cluster Selection Method:
    • The eom method produces a consistent and interpretable number of clusters (3 clusters) with positive silhouette scores.
    • The leaf method yields an excessive number of clusters and negative silhouette scores.
  • Parameter Values (min_cluster_size and min_samples):
    • For the cosine metric with eom:
      • min_cluster_size = 10 or 20 combined with min_samples = 3, 5, or 10 all yield 3 clusters with silhouette scores between 0.182 and 0.185.
      • For min_cluster_size = 30 , only the configuration with min_samples = 3 produces acceptable performance (3 clusters, 0.185 silhouette); the other settings degrade the quality.

Conclusion

Based on the results, we should pick the followings: metric: cosine cluster_selection_method: eom min_cluster_size: 10 (or 20, as the performance is similar) min_samples: 3

This combination gives us 3 clusters with the following scores

silhouette ≈ 0.185
CH ≈ 112.1
DB ≈ 4.719

It has the best balance between a high silhouette score (indicating well-separated and cohesive clusters) and a tight number of clusters. In contrast, while the euclidean metric and the leaf method are explored, they yield lower silhouette scores and a fragmented cluster structure.