Perform multimodal image search and visualization using CLIP, ChromaDB, UMAP and Bokeh

In this blog post, I am going to show you how to perform image search against the Unsplash Lite dataset with 25k photos, using both text and image queries. To better comprehend our search results, I will then visualize the query and its matches using UMAP dimension reduction and plot it with Bokeh.

By the end of this post, we will have generated a plot like this, showing both our input query image, and photos in the Unsplash Lite dataset that are closest in semantic meaning to it. In this plot, our input image of Tokyo Tower is matched to Tokyo Skytree, Himeji Castle, Eiffel Tower, Japanese city at night, Tokyo street scenes, aerial city view, etc. in proximity order. The closer an image is to the red box, the higher is the match. We will also examine the semantic understanding of the CLIP visual transformer behind this.

tokyo

Let’s go!

Preparing materials

retrieve all metadata of the Unsplash Lite dataset

First, we need to download the Lite dataset. Notice that it does not contain the actual images, but just metadata such as photo ID, actual image URL, photographer name, description generated by AI, geolocation, image EXIF data, keyword used for the iamge, colors, etc.

After downloading the file, we can load the data into a dataframe.

import numpy as np
import pandas as pd
import glob

path = '.'
documents = ['photos', 'keywords', 'collections', 'conversions', 'colors']
datasets = {}

for doc in documents:
  print(doc)
  files = glob.glob(doc + ".csv000*")
  print(files)
  subsets = []
  for filename in files:
    print(filename)
    df = pd.read_csv(filename, sep='\t', header=0)
    subsets.append(df)

  if subsets:
      datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)
  else:
      # Handle the empty case 
      datasets[doc] = pd.DataFrame()  # or log a message / raise Exception

We can then explore the dataset with datasets['photos'].head() , which output the following:

photo dataset

retrieve precomputed embeddings from SentenceTransformer

Since we now have URLs to each photo in the dataframe, we can use an embedding model to generate numerical representation of each images. A good visual transformer will be able to capture the semantic meaning of most of the image elements in the embedding.

However, to reduce compute resource usage and time, we will use precomputed image embeddings provided by the SentenceTranformer documentation instead.

import requests

url = "http://sbert.net/datasets/unsplash-25k-photos-embeddings.pkl"
response = requests.get(url)

with open("unsplash-25k-photos-embeddings.pkl","wb") as file:
    file.write(response.content)

We then deserialize the file to get both the image names and embeddings.

import pickle
emb_filename="unsplash-25k-photos-embeddings.pkl" 
with open(emb_filename, 'rb') as fIn: 
    img_names, img_emb = pickle.load(fIn)

Save the embeddings to a Chroma vector database

Next, we will use Chroma vector database to store the embeddings, together with their respective metadata.

I have been using Chroma for 1.5 years and found it to be lightweight and fast. I also like its local implementation using SQLite to store its data, making it very portable. It’s API is straightforward and can be incorporated into LangChain or other frameworks if you intend to extend its functionality

Note that since the precomputed embeddings are generated with the visual transformer model clip-ViT-B-32 from OpenAI, we will instantize our Chroma collection with the same embedding model. This step is crucial for querying the collection later.

When we supply a query, Chroma will first embed each query with the collection’s embedding function, and then use similarity function to compare the query embedding with the existing embeddings in this collection. Therefore, we need to ensure both the query and the existing embeddings are embedded using the same model for the comparison to make sense.

create a new Chroma collection

import chromadb
from chromadb.utils import embedding_functions

emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="clip-ViT-B-32")

client = chromadb.PersistentClient(path='.')
collection = client.create_collection(name="unsplash25k", embedding_function=emb_fn)

extract embedding and metadata from prepared materials

We will retrieve these columns from the dataset photos dataframe:

photo_id
photo_image_url (a URL pointing to the actual image)
ai_description (use as image captions for plotting)

and just as a reminder, we will use the followings from the deserialized pickle file:

img_names (image names are in the format xyz.jpg. Here we will remove jpg and match it with photo_id above)
img_emb: precomputed embeddings

Now we will merge both so that each embedding is associated with its unique metadata.

# Build a mapping from photo_id to metadata with renamed keys ("url" and "description")
metadata_map = {}
for _, row in datasets['photos'].iterrows():
    # Ensure photo_id is a string; it should match img_names without extension
    photo_id = str(row["photo_id"])
    # Use a fallback default value if any metadata field is missing or None
    url = row.get("photo_image_url") or ""
    description = row.get("ai_description") or ""
    metadata_map[photo_id] = {"url": row["photo_image_url"], "description": row["ai_description"]}

# Prepare lists to hold the ids, embeddings, and metadata in the right order
ids = []
embeddings = []
metadatas = []

for name, emb in zip(img_names, img_emb):
    # Remove the file extension (.jpg) to retrieve the photo_id
    photo_id = name.rsplit(".", 1)[0]
    ids.append(photo_id)
    embeddings.append(emb)
  
    # Retrieve metadata using the photo_id; assign defaults if it doesn't exist
    meta = metadata_map.get(photo_id, {"url": "", "description": ""})
    # Ensure neither 'url' nor 'description' is None
    url = meta.get("url") if meta.get("url") is not None else ""
    description = meta.get("description") if meta.get("description") is not None else ""
    metadatas.append({"url": url, "description": description})

add all to a Chroma collection

Chroma only supports a maximum batch of 5461 items in each update. Since we have 25k records to add, we need to do that in batches.

MAX_BATCH_SIZE = 5461
total_items = len(ids)

for i in range(0, total_items, MAX_BATCH_SIZE):
    batch_ids = ids[i:i + MAX_BATCH_SIZE]
    batch_embeddings = img_emb[i:i + MAX_BATCH_SIZE]
    batch_metadata = metadatas[i:i + MAX_BATCH_SIZE]
  
    collection.add(
        ids=batch_ids,
        embeddings=batch_embeddings,
        metadatas=batch_metadata
    )

Search saved embeddings with text

Now we are ready for action!

We can do a very simple search to see what we get. As discussed previously, Chroma automatically uses the embedding model we specified when we initiated the collection to embed it, then do a semantic search for n similar images we specified.

query_text = "beautiful mountain lake"
collection.query(query_texts=[query_text], n_results=5)

which gives

{'ids': [['fKdkTVEYMiQ',
   'DXcIb5pmMEg',
   'pwkHJXr01bQ',
   'y5Tk8f7TBqw',
   'mg-k04n58xY']],
 'embeddings': None,
 'documents': [[None, None, None, None, None]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'description': 'aerial photo of snow covered mountain near body of water',
    'url': 'https://images.unsplash.com/photo-1545423269-690becec20d6'},
   {'description': 'green pine trees near snow covered mountain during daytime',
    'url': 'https://images.unsplash.com/photo-1581263117539-6873a143df45'},
   {'url': 'https://images.unsplash.com/photo-1563713665854-e72327bf780e',
    'description': 'body of water near island'},
   {'url': 'https://images.unsplash.com/photo-1575905290477-6313d6493abd',
    'description': 'building near trees and body of water during day'},
   {'url': 'https://images.unsplash.com/photo-1577403922630-ee5218668ae5',
    'description': 'body of water'}]],
 'distances': [[136.6397247314453,
   136.76376342773438,
   138.41220092773438,
   138.96884155273438,
   139.31295776367188]]}

This is not very helpful at all.

Now let’s use the URLs in the metadata to retrieve the images online, and display their respective descriptions above each image. We are going to use Matplotlib to format the result properly.

from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt
import textwrap

query_text = "beautiful mountain lake"  
results = collection.query(query_texts=[query_text], n_results=5, include=["metadatas"])

# Extract metadata
matched_metadata = results["metadatas"][0]

# Resize dimensions
max_width = 300   

# Fetch and process images
images = []
titles = []

for metadata in matched_metadata:
    image_url = metadata["url"]
    title = metadata["description"]
  
    if image_url:
        response = requests.get(image_url)
        img = Image.open(BytesIO(response.content))

        # Resize image while maintaining aspect ratio
        aspect_ratio = img.height / img.width
        new_height = int(max_width * aspect_ratio)
        img = img.resize((max_width, new_height))

        images.append(img)

        # Wrap text for better readability (adjust width if needed)
        wrapped_title = "\n".join(textwrap.wrap(title, width=30))  
        titles.append(wrapped_title)

# Display images in a single row with wrapped text
fig, axes = plt.subplots(1, len(images), figsize=(len(images) * 3, 4))

for ax, img, title in zip(axes, images, titles):
    ax.imshow(img)
    ax.set_title(title, fontsize=10, wrap=True)  # Ensure proper wrapping
    ax.axis("off")

plt.tight_layout()
plt.show()

search result in a row

Now our search results are rendered in a much easier to understand format.

UMAP plot with all image embeddings

It is also helpful to visualize the distribution of our data so that we can do more analysis or clustering later. To do so, let’s use the UMAP dimension reduction algorithm to reduce our image embeddings from our original dimension of 512 (determined by CLIP) to 2.

Dimension reduction has a lot of uses other than aiding data visualization and interpretation, e.g., feature extraction, reduced resource usage and mitigate data sparsity. Since embeddings are non-linear, we cannot use PCA, but can pick from either t-SNE or UMAP to reduce their dimensions. I pick UMAP because t-SNE is slower with a larger dataset. With UMAP, I can also retain meaningful distance and densities, as well as generate more reproducible results across runs to verify the results.
Dimension reduction is usually the first step in exploratory data analysis, thus the preserved global relationships in UMAP embeddings will fit better if I take them into downstream tasks such as clustering, classification and topic modelling.

The reduced embeddings are then fed to Bokeh to plot them in an interactive chart.

As a bonus, we will also implement hovering so that when we hover over a dot of an embedding, the respective image and description will be shown.

import numpy as np
import umap
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool

# Ensure Bokeh output appears in the notebook
output_notebook()

# Retrieve embeddings and metadatas 
results = collection.get(include=["embeddings", "metadatas"])
embeddings = np.array(results["embeddings"])
metadata_list = results["metadatas"]

# Extract "url" and "description" from the metadata. 
urls = [meta.get("url", "") for meta in metadata_list]
descriptions = [meta.get("description", "") for meta in metadata_list]

# Create a UMAP reducer for 2D embedding
reducer = umap.UMAP(n_components=2, random_state=42)
embedding_2d = reducer.fit_transform(embeddings)

# Prepare a Bokeh ColumnDataSource
source = ColumnDataSource(data={
    "x": embedding_2d[:, 0],
    "y": embedding_2d[:, 1],
    "url": urls,
    "desc": descriptions,
})

# Create a Bokeh figure
p = figure(width=800, height=600, title="UMAP of 25k Unsplash-Lite image embeddings ", tools="pan,wheel_zoom,box_zoom,reset")

# Configure the hover tool to display an image (via its URL) and the description
hover = HoverTool(tooltips="""
    <div>
        <div>
            <img src="@url" alt="Image" style="width:150px;"/>
        </div>
        <div>
            <span style="font-size: 12px;">@desc</span>
        </div>
    </div>
""")
p.add_tools(hover)

# Plot each embedding as a circle
p.circle('x', 'y', size=10, source=source, fill_alpha=0.6)

# Display the plot
show(p)

Bokeh

We can then hover over a region to see what the concentration is about.

Zooming in to a specific area, we can confirm that the nearby embeddings are indeed very similar images.

brown1 brown2 brown3

Viewing the plot inline in a Jupyter notebook might feel cramped. We can output the plot in an HTML file to explore it in a bigger viewport. Then we can freely pan, zoom and hover in a browser.

from bokeh.plotting import output_file, show
from bokeh.layouts import layout

# Instead of output_notebook(), use output_file to write to an HTML file.
output_file("umap_plot.html")

# Wrap the figure in a layout with sizing_mode "stretch_both" so it expands to fill the browser window.
layout_plot = layout([p], sizing_mode="stretch_both")

# Display the plot in browser
show(layout_plot)

Multimodal search and visualization

Now let’s take this further. Let’s search using either text or image, and visualize our search results in a plot.

We will also display our query prominently in the plot so we can get a feel of its relative distance (similarity) to the matches.

Embedding functions for text/image query

We will first define embedding functions for our query (which can be text or image). Since we need to use the query embedding to display it in the plot, we cannot rely on the automatic embedding feature of Chroma collection. Rather, we will manually embed the query to get the vectors ourselves. We will be using CLIP again which maps images and texts to the same vector space.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('clip-ViT-B-32')

def get_text_embedding(text):
    # Encode the text string
    query_embedding  = model.encode(text)
    return query_embedding

def get_image_embedding(path):
    # Takes both online and local images.
    # If the query string starts with "http", it downloads the image.
    # Otherwise, it treats the query as a local file path.
    # Returns a NumPy array embedding.

    # Check if query is an online URL.
    if path.startswith("http"):
        response = requests.get(path)
        if response.status_code != 200:
            raise Exception(f"Failed to download image from URL: {path}")
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(path)
  
    # Get the image embedding as a NumPy array.
    image_embedding = model.encode(image, convert_to_tensor=False)
    return image_embedding

Next, we will define a helper function to convert a local image file to a data URL suitable for display in Bokeh.

import imghdr
import os
from base64 import b64encode

def local_image_to_data_url(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"Local file '{path}' not found.")
    with open(path, "rb") as f:
        data = f.read()
    img_type = imghdr.what(None, h=data)
    data_url = "data:image/{};base64,".format(img_type) + b64encode(data).decode('utf-8')
    return data_url

Query vector database and dimension reduction

We will then define functions to query our existing Chroma collection, extract matching embeddings and metadata.

We then use UMAP to reduce the original embeddings from 512 dimensions to 2. Finally, we will prepare the result as input data for Bokeh.

def query_umap_data(collection, query, query_type="text"):
    """
    Embed query, retrieves the 10 most similar images (with embeddings and metadata),
    computes a UMAP projection, and prepares Bokeh data sources.
    Returns:
      query_source : a ColumnDataSource for the query point
      image_source : a ColumnDataSource for the similar images
      (query_x, query_y) coordinates for the query point (for use in plotting)
    """ 
  
    # Determine query embedding and corresponding query parameters.
    if query_type == "text":
        query_embedding = get_text_embedding(query)
        query_params = {"query_texts": [query]}
    elif query_type == "image":
        # For images, if query starts with "http", treat it as an online image.
        if query.startswith("http"):
            query_image_url = query
        else:
            query_image_url = local_image_to_data_url(query)
        query_embedding = get_image_embedding(query)
        query_params = {"query_embeddings": [query_embedding]}
    else:
        raise ValueError("query_type must be either 'text' or 'image'")
  
    # Query the collection for the 10 most similar images.
    results = collection.query(
        **query_params,
        n_results=10,
        include=["embeddings", "metadatas"]
    )
  
    # Extract the image embeddings and metadata.
    image_embeddings = np.array(results["embeddings"][0])
    metadata_list = results["metadatas"][0]
  
    # Extract similar image URLs and descriptions.
    sim_urls = [meta.get("url", "") for meta in metadata_list]
    descriptions = [meta.get("description", "") for meta in metadata_list]
    wrapped_descriptions = [textwrap.fill(text, width=20) for text in descriptions]
  
    # Combine the query embedding with the retrieved image embeddings
    # and compute a UMAP projection in 2D.
    all_embeddings = np.vstack([query_embedding, image_embeddings])
    reducer = umap.UMAP(n_components=2, random_state=42)
    umap_proj = reducer.fit_transform(all_embeddings)
  
    # The first coordinate is for the query and the rest for similar images.
    query_x, query_y = umap_proj[0, 0], umap_proj[0, 1]
    image_x = umap_proj[1:, 0]
    image_y = umap_proj[1:, 1]
  
    # Prepare Bokeh data sources.
    if query_type == "text":
        query_source = ColumnDataSource(data={
            "x": [query_x],
            "y": [query_y],
            "query": [query]
        })
    else:  # For image queries, include the image URL for display.
        query_source = ColumnDataSource(data={
            "x": [query_x],
            "y": [query_y],
            "url": [query_image_url]
        })
  
    image_source = ColumnDataSource(data={
        "x": image_x,
        "y": image_y,
        "url": sim_urls,
        "wrapped_desc": wrapped_descriptions,
    })
  
    return query_source, image_source, query_x, query_y

Plot both query and search results with Bokeh

Next, we will use the Bokeh data sources and query coordinates to create a Bokeh figure that visualizes the query relative to similar images. The design will be as follows,

- For text queries: the query is depicted as a red dot with the text label.
- For image queries: the query image is displayed with a red border. A red text "Query Image" is displayed above.
- Similar images are shown as thumbnails with word-wrapped descriptions above.


def plot_umap_data(query_source, image_source, query_x, query_y, query_type="text"):

    p = figure(title="UMAP Visualization: Query vs. Similar Images",
               width=800, height=600, tools="pan,zoom_in,zoom_out,reset")
  
    # Plot similar images as thumbnails.
    p.image_url(url='url', x='x', y='y', w=0.4, h=0.4, anchor="center", source=image_source)
  
    # Add labels for each similar image's description.
    labels = LabelSet(
        x='x', y='y', text='wrapped_desc',
        x_offset=0, y_offset=35,  # position above each thumbnail
        source=image_source,
        text_font_size="8pt",
        text_align="center",
        text_baseline="bottom",
        text_color="black",
        background_fill_color="white",
        background_fill_alpha=0.7
    )
    p.add_layout(labels)
  
    # Display the query point.
    if query_type == "text":
        # For text queries: display a red dot with a label.
        p.circle('x', 'y', size=15, color="red", source=query_source)
        query_label = LabelSet(
            x='x', y='y', text='query',
            x_offset=0, y_offset=20,
            source=query_source,
            text_font_size="9pt",
            text_color="red",
            text_align="center",
            text_baseline="bottom",
            background_fill_color="white",
            background_fill_alpha=0.7
        )
        p.add_layout(query_label)
    else:
        # For image queries: display the query image and add a red border.
        p.image_url(url='url', x='x', y='y', w=0.4, h=0.4, anchor="center", source=query_source)
        p.rect(x=query_x, y=query_y, width=0.4, height=0.4,
               fill_alpha=0, line_color="red", line_width=2)

        # Add a static Label for "Query Image" above the red box
        query_image_label = Label(
            x=query_x, y=query_y,               # Base coordinates of the query point
            x_offset=0, y_offset=40,            # Offset to position the text above the image
            text="Query Image",                 # The label text
            text_font_size="10pt",
            text_color="red",
            text_align="center",
            text_baseline="bottom",
            background_fill_color="white",
            background_fill_alpha=0.7
        )
        p.add_layout(query_image_label)

    return p

We will wrap up everything with a helper function to handle all the above steps.

def display(collection, query, query_type="text"):
    """
    High-level function that accepts a query (text string or image path/URL) and its type,
    then performs the query, computes the UMAP projection, and displays the interactive Bokeh plot.
    """
    query_source, image_source, query_x, query_y = query_umap_data(collection, query, query_type)
    p = plot_umap_data(query_source, image_source, query_x, query_y, query_type)
    show(p)

Finally: Query time!

Now we can simply call the display function with a query of any sort. We just need to supply a ChromaDB collection, our query, and specify whether it’s a text string or an image path.

For example, doing a text search with the phrase beautiful mountain lake as in display(collection, "beautiful mountain lake", query_type="text") will show

lake

Doing a search with an online image with display(collection, "https://wallpaperaccess.com/full/2733882.jpg", query_type="image") will show

web

We can also query using a local image of Tokyo Tower at night, like display(collection, "tokyotower.jpg", query_type="image") which shows

tokyo

Analyze the semantic understanding of the CLIP visual transformer

Since the dataset is small with only 25k images, it is highly likely to contain no image of Tokyo Tower. However, a semantic search with embeddings produced by the CLIP visual transformer is still very powerful. We can see that the results all carry some elements that are similar to Tokyo Tower.

Let’s examine the results further in counter-clockwise direction:

On the right is a cherry blossom framed view of Tokyo Skytree
Above it is Himeji Castle surrounded by cherry blossoms
On the upper left, we see Eiffel Tower at night
On the left is a dusk/dawn view of buildings. Notice that in both this and Eiffel tower, the sky is taking up 1/2 of the image space, similar to our Tokyo Tower photo
At the bottom is a night photo of a nightlife district in Shizuoka City, Japan, followed by a photo in Ginza, Tokyo (indicated by GSix shopping arcade and Ginza 5-chome street sign on the left).
On the lower right is another scene in Tokyo (confirmed with lat-long data from dataset metadata but mislabeled by Unsplash AI as NYC.
Farther away, we see photos of aerial city night scene.

This clearly shows that the embedding model we use (CLIP ViT) is able to extract semantic meanings of the query like Tokyo, Japan, tower (or high rise structures), city, street, night, buildings, aerial view, image element proportion, etc. Moreover, as the plot coordinates of each image are dictated by (reduced) embeddings, we observed that images that appear in proximity to our query in the plot are in fact semantically similar.

Multilingual search

We can also perform text search using multiple languages by encoding our query text with the clip-ViT-B-32-multilingual-v1 text embedding model, which is aligned to the clip-ViT-B-32 and maps 50+ languages to the same vector space. We can continue to use the same clip-ViT-B-32 image encoder to encode both query and stored images.

Conclusion

In this blog post, I have illustrated how to do multimodal search with ChromaDB. Specifically, we explore using dimension reduction to examine the distribution of an image dataset, and use the same technique to visualize search result so that we have a better idea about the semantic and spatial relationship between the query and its matches.

Preparing materials#

retrieve all metadata of the Unsplash Lite dataset#

retrieve precomputed embeddings from SentenceTransformer#

Save the embeddings to a Chroma vector database#

create a new Chroma collection#

extract embedding and metadata from prepared materials#

add all to a Chroma collection#

Search saved embeddings with text#

UMAP plot with all image embeddings#

Multimodal search and visualization#

Embedding functions for text/image query#

Query vector database and dimension reduction#

Plot both query and search results with Bokeh#

Finally: Query time!#

Analyze the semantic understanding of the CLIP visual transformer#

Multilingual search#

Conclusion#