Perform multimodal and spatial search using PostgreSQL as a vector db

This is #2 of the Semantic Search with PostgreSQL series. In #1, I have setup PostgreSQL and pgvector extension. This time, let’s build on top of these and perform different queries using PostgreSQL with pgvector extension.

By the end of this post, we will be able to search for photos using an arbituary location, natural language and image. We will also be able to combine everything to pull only photos shot by Nikon within 50km of New York City that contains modern architecture. We’ll compare supported features and performance between PostgreSQL and Chroma as vector engines.

Why PostgreSQL?

In the previous experiment, we loaded the Unsplash 25k image embeddings with their respective metadata into Chroma, and perform semantic search. Chroma is super fast with embeddings calculation and light-duty metadata search with basic filtering.

However, in scenarios where the data is more complex (e.g., e-commerce with photos, product info, user and transactions), a real relational database that allows joining tables will serve much better. Or in my existing project, where I need semantic (photo and text), relational (joining various tables) AND geospatial search, a more versatile platform such as PostgreSQL with extensions will fit the bill perfectly.

In our experiment, the Unsplash lite dataset contains several files, each linked up by the photo_id. Imagine this trip planning scenario:

I have a photo of a landmark (e.g., Tokyo Tower)
I want to find the most similar photos to it
The targets must be shot around Tokyo (or, if we are adventurous, we can target Shanghai to see if there’s anything there that looks like Tokyo Tower)
Once I have the targets, I want to check out the photo collections they are in, to expand my exploration. For example, if I get Oriental Pearl Tower in Shanghai as a target, it might have been added to a photo collection of other Shanghai modern skyscrappers. Now I can visit more places in my trip!

This scenario will be very hard to do in Chroma, especially #3 (spatial search) and #4 (the classic SQL join), and that’s when PostgreSQL comes in.

Obtain Unsplash dataset

Get all metadata

note: this is a rehash of what we did in the previous post. I am reposting it here for clarity.

First, we need to download the Unsplash Lite dataset with metadata such as photo ID, actual image URL, photographer name, description generated by AI, geolocation, image EXIF data, keyword used for the iamge, colors, etc.

After downloading the file, we load the data into a dataframe.

import numpy as np
import pandas as pd
import glob

path = '.'
documents = ['photos', 'keywords', 'collections', 'conversions', 'colors']
datasets = {}

for doc in documents:
  print(doc)
  files = glob.glob(doc + ".csv000*")
  print(files)
  subsets = []
  for filename in files:
    print(filename)
    df = pd.read_csv(filename, sep='\t', header=0)
    subsets.append(df)

  if subsets:
      datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)
  else:
      # Handle the empty case 
      datasets[doc] = pd.DataFrame()

Get precomputed image embeddings

note: again this is a rehash of the previous post.

For the 25k Unsplash images, we will use precomputed image embeddings provided by the SentenceTranformer documentation.

import requests

url = "http://sbert.net/datasets/unsplash-25k-photos-embeddings.pkl"
response = requests.get(url)

with open("unsplash-25k-photos-embeddings.pkl","wb") as file:
    file.write(response.content)

We then deserialize the file to get both the image names and embeddings.

import pickle
emb_filename="unsplash-25k-photos-embeddings.pkl" 
with open(emb_filename, 'rb') as fIn: 
    img_names, img_emb = pickle.load(fIn)

Get source images

We’ll also save the 25k Unsplash images for offline use, per SentenceTranformer documentation.

img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
  
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
    
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

Ingest to Postgres

Next, we will use PostgreSQL to store the embeddings, together with their respective metadata.

Prepare data

Since we want to link up the embeddings to their respective metadata, we need to find a common value between them.

in the embedding pickle file, the img_names is in the format xyz.jpg
in datasets['photos'] dataframe, a photo_id column stores each image name without the .jpg suffix

Thus we will

Strip the .jpg suffix from the pickle’s img_names
Rename that column to photo_id
Merge it with the dataframe using its photo_id column

First we strip the suffix

photo_ids = [name.rsplit('.', 1)[0] for name in img_names]

We then create a DataFrame with one row per image and an embedding column whose entries are vectors we can push into pgvector later. For the embeddings, we normalize them to a 1-D Python list where each element is itself a 1-D list/array, so psycopg2 serializes it correctly.

df = pd.DataFrame({
    'photo_id': photo_ids,
    'embedding': [row for row in img_emb] 
})

We then merge the embeddings with metadata, unified by the photo_id key.

metadata = datasets['photos']

df = df.merge(metadata, on='photo_id', validate='one_to_one')

Later, we will use photo_id as the primary key in our Postgres table DDL.

Create a new database

Create a dedicated database:
sudo-u postgres createdb imagesdb

Enable the vector extension:

sudo -u postgres psql -d imagesdb
-c "CREATE EXTENSION IF NOT EXISTS vector;"

Connect to database from Python

Install psycopg2 (the PostgreSQL adapter for Python):
pip install psycopg2-binary

Connect to the newly created imagesdb database

conn = psycopg2.connect(
    host="127.0.0.1",
    port=5432,
    dbname="imagesdb",
    user="postgres",
    password=os.getenv("PG_PASS")
)
cur = conn.cursor()

Enable pgvector extension

Even though we’ve installed the pgvector package in the previous post, the vector data type only exists in databases where we’ve enabled the extension. we have to run: CREATE EXTENSION vector; in every database where we plan to use the VECTOR(...) data type.

cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()

Create a table

We need to make sure the table’s VECTOR(<dim>) matches our actual embedding length.

Since the metadata contains a column stats_views with a maximum value of 2.7 billion, we have to use BIGINT for that column. Ideally, we’ll detect overflows in any integer column and apply BIGINT automatically.

dim = len(df['embedding'].iloc[0])
int32_max = np.iinfo(np.int32).max

col_defs = []
for col, dtype in df.drop(['photo_id','embedding'], axis=1).dtypes.items():
    if dtype.kind == 'i':
        pg_type = 'BIGINT' if df[col].max() > int32_max else 'INTEGER'
    elif dtype.kind == 'f':
        pg_type = 'DOUBLE PRECISION'
    else:
        pg_type = 'TEXT'
    col_defs.append(f"{col} {pg_type}")

create_sql = f"""
DROP TABLE IF EXISTS images;
CREATE TABLE images (
  photo_id TEXT PRIMARY KEY,
  embedding VECTOR({dim}),
  {', '.join(col_defs)}
);
"""

cur.execute(create_sql)
conn.commit()

Insert rows

columns = df.columns.tolist()
insert_sql = sql.SQL("INSERT INTO images ({}) VALUES %s").format(
    sql.SQL(', ').join(map(sql.Identifier, columns))
)

records = [
    tuple([row['photo_id'], row['embedding']] +
          [row[c] for c in columns if c not in ('photo_id','embedding')])
    for _, row in df.iterrows()
]

execute_values(cur, insert_sql.as_string(cur), records, page_size=1000)
conn.commit()

Verify

Login to postgres and connect to the imagesdb database we just created
```
sudo -u postgres psql -d imagesdb
```
List all tables in the database: \dt
We’ll see the images table we just created.
Examine the images table’s schema (column names, data types, default values, indices) with \d images, or get extended details (including storage and size) with \d+ images
Inspect images table’s row count to make sure everything is ingested:
```
SELECT COUNT(*) FROM images;
```

Query

First let’s connect to our database

import psycopg2

conn = psycopg2.connect(
    host="127.0.0.1", port=5432,
    dbname="imagesdb", user="postgres",
    password=os.getenv("PG_PASS")
)
cur = conn.cursor()

Search metadata

Let’s now try a simple query to display the first 5 images taken by Canon, using the exif_camera_make column in the metadata.

from IPython.display import HTML, display

cur.execute("SELECT photo_id FROM images WHERE exif_camera_make ILIKE %s LIMIT 5;", ("Canon",))
ids = [r[0] for r in cur.fetchall()]

img_tags = ""
for pid in ids:
    path = f"photos/{pid}.jpg"
    if os.path.exists(path):
        img_tags += f"<div style='margin:4px'><img src='{path}' width='200'></div>"
    else:
        img_tags += f"<div style='margin:4px;color:red'>missing: {pid}.jpg</div>"

html = f"""
<div style="
     display: flex;
     flex-direction: row;
     align-items: flex-start;
     flex-wrap: nowrap;
">
  {img_tags}
</div>
"""

display(HTML(html))

we will get the following

canon

Semantic query

Let’s reuse the embedding function from my previous post to generate embeddings for both text and image queries.

def get_embedding(query, type):
    """
    Obtain an embedding using SentenceTransformer with CLIP.
    This function handles both text and image, denoted by "type" 
    For image, if the query string starts with "http", it downloads the image.
    Otherwise, it treats the query as a local file path.
  
    Returns a NumPy array embedding.
    """ 
    from PIL import Image
    import requests
    from io import BytesIO

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('clip-ViT-B-32')

    if type=="image":
        # Check if query is an online URL.
        if query.startswith("http"):
            response = requests.get(query)
            if response.status_code != 200:
                raise Exception(f"Failed to download image from URL: {query}")
            image = Image.open(BytesIO(response.content))
        else:
            image = Image.open(query)
    
        # Get the image embedding as a NumPy array.
        embedding = model.encode(image, convert_to_tensor=False)
    else:
        embedding = model.encode(query)
  
    return embedding

Query with text

We can then do a text query to find the most similar images with a text phrase. We will be using pgvector’s K-NN search operator (<->) to rank by distance. The steps are

Compute a CLIP-ViT-B/32 embedding for our query text

query_emb = get_embedding("beautiful lake","text")

Execute a SQL query against the images table. Let’s break this down:
- In psycopg2, %s is a placeholder for a query parameter. When we call cur.execute(sql, (query_emb.tolist(),)) later, psycopg2 takes query_emb.tolist() and substitutes it for %s in the SQL statement, handling string quoting and escaping exactly as Postgres expects.
- ::vector casts that query embedding array into the pgvector vector type
- The <-> operator now accepts the parameter as a true vector and computes the nearest-neighbor Euclidean distance.
- AS distance renames that computed distance so we can reference it in ORDER BY, to sorts the result set in ascending order.
- query_emb.tolist(): turns the embedding NumPy array into a plain Python list of floats. psycopg2 does not know how to adapt a NumPy ndarray, but can serialize a Python list into a Postgres array literal (e.g. {0.1,0.2,…}).
```
# Parameterized KNN query: smallest distance first
sql = """
SELECT photo_id,
    exif_camera_make,
    embedding <-> %s::vector AS distance
FROM images
ORDER BY distance
LIMIT 5;
"""
cur.execute(sql, (query_emb.tolist(),))
results = cur.fetchall()
```

Display the thumbnails inline, sorted by semantic distance from the text query.

items = []
for pid, make, dist in results:
    img_src = f"photos/{pid}.jpg"
    items.append(f"""
    <div style="margin:4px;text-align:center">
        <img src="{img_src}" width="200"><br>
        <small>{pid}<br>{make}<br>d={dist:.2f}</small>
    </div>
    """)
html = f"<div style='display:flex'>{''.join(items)}</div>"
display(HTML(html))

We get this result: text

Query with image

Similarly, we can do an image query with the same Tokyo Tower photo we used last time. Just change the first line of code to

query_emb = get_embedding("tokyo-wallpaper.jpg","image")

and run it. We will get

tokyo

Likewise, we can query with an online image, such as this Mt. Fuji photo from Wikipedia. Just substitute the first 2 lines of code with

imgURL="https://upload.wikimedia.org/wikipedia/commons/f/f8/View_of_Mount_Fuji_from_%C5%8Cwakudani_20211202.jpg"

query_emb = get_embedding(imgURL,"image")

We will get this

onlineImg

Semantic and metadata search

What if we want to search with an image or natural language query, but also scope it to return photos taken by a specific camera make (e.g. Canon)? In this case, we just need to add a WHERE clause to the SQL statement, just like any normal TSQL query.

sql = """
  SELECT photo_id,
         exif_camera_make,
         embedding <-> %s::vector AS distance
    FROM images
   WHERE exif_camera_make ILIKE %s
   ORDER BY distance
   LIMIT 5;
"""
params = (query_emb.tolist(), 'Canon%')
cur.execute(sql, params)

This filters for rows where the camera make starts with “Canon”, case-insensitively. We can use = %s in the SQL statement if we want an exact match, or pass the parameter as %Canon% for substring matching.

Notice now we have two %s placeholders and a two-element tuple for the parameters. We can pass any camera make (or even a list) to ‘params` without changing the SQL string.

Rerun our code with this updated SQL statement for the Wikipedia Mt. Fuji photo, we get the following results:

FujiCanon

Multilingual search

We can also perform text search using multiple languages by encoding our query text with the clip-ViT-B-32-multilingual-v1 text embedding model, which is aligned to the clip-ViT-B-32 and maps 50+ languages to the same vector space. For image query, we can continue to use the same clip-ViT-B-32 model.

Spatial search

One of the factors driving my switch from Chroma to Postgres is spatial search. There are all sorts of powerful extensions such as PostGIS, earthdistance/cube (provides earth-curved accuracy without PostGIS), etc. But even the native planar support from PostgreSQL proves to be quite useful for small areas. Let’s see!

note: The Unsplash Lite dataset comes with limited geodata, as only 28% of images have complete lat/long data. Nevertheless, this section is a quick proof-of-concept to show the versatility of using Postgres.

Update table for geometric data

We’ll first update our table schema to take advantage of PostgreSQL’s native geometric types such as point, which is a a 2D coordinate (x,y). Other native geometric types include lseg, box and path, which we are not going to used for this test.

ALTER TABLE images
ADD COLUMN geo_pt point;

UPDATE images
SET geo_pt = point(photo_location_longitude, photo_location_latitude);
ALTER TABLE

Geocoding

We’ll then use Nominatim, a free OpenStreetMap geocoding API to obtain the latitude and longitude of a place programmatically. First install it with pip install geopy.

Then we will write our geocode script to resolve a location name with its latitude and longitude:

from geopy.geocoders import Nominatim

def geocode_city(city_name: str):
    geolocator = Nominatim(user_agent="geo_demo")
    loc = geolocator.geocode(city_name)
    if not loc:
        raise ValueError(f"Couldn’t geocode: {city_name}")
    return loc.latitude, loc.longitude

location="Kyoto"
lat, lon = geocode_city(location)
print(f"{location} → lat: {lat:.6f}, lon: {lon:.6f}")

We’ll get Kyoto → lat: 35.011575, lon: 135.768144 successfully.

Simple spatial search

Now we have Kyoto’s geodata. Let’s find 5 images closest to it.

Our updated SQL statement looks like this, and takes advantage of the updated table with support for point data type. Let’s break down the spatial search part:

in geo_pt <-> point(%s, %s), the <-> operator on two PostgreSQL point values returns their Euclidean (straight-line) distance between two points.
- This is identical to when we compare embeddings, only that <-> always returns a result in units of the point’s coordinate system. Since geo_pt was created as point(longitude, latitude), that distance is measured in degrees of longitude/latitude, not in meters or kilometers.
point(%s, %s) constructs an in-query point from the supplied longitude and latitude parameters in cur.execute(sql, (lon, lat, 5)) below.

note: For more advanced usage, we can search by containment (e.g. box @> point), && for bounding-box intersection, as well as use functions such as area(), length(), center(), perimeter().

sql = """
SELECT
    photo_id,
    photo_location_city,
    photo_location_country,
    geo_pt <-> point(%s, %s) AS planar_dist
FROM images
WHERE geo_pt IS NOT NULL
ORDER BY planar_dist
LIMIT %s;
"""

We can then execute the query by passing along Kyoto’s geodata and displaying the results:

cur.execute(sql, (lon, lat, 5))

results = cur.fetchall()
items = []
for pid, city, country, dist in results:
    img_src = f"photos/{pid}.jpg"
    items.append(f"""
    <div style="margin:4px;text-align:center">
        <img src="{img_src}" width="200"><br>
        <small>{pid}<br>{city}<br>d={dist:.4f}</small>
    </div>
    """)
html = f"<div style='display:flex'>{''.join(items)}</div>"
display(HTML(html))

spatial kyoto

Spatial and semantic search

We can combine our pgvector semantic k-NN and point-based spatial index to get “the most similar images to an image/phrase within X km of Tokyo.”

The beauty of this is that we’re mixing in natural language (e.g., to describe an ambience, particular object, etc.) to search for images nearest a location. In the following example, we want to look up 5 modern architecture photos within 50km of New York City.

Notice that this is a two step process. In our example, we want to find images within 50km of New York City.

Use a spatial pre-filter WITH spatial_subset AS to narrow down candidates by geography. This is to reduce computing expensive vector distances. In the spatial search
WHERE geo_pt <-> point(%s, %s) < %s
- < %s accepts only all images whose geo_pt is smaller a threshold value (the third %s), which is radius_deg.
- Because our <-> operator distance is in degrees, we must translate our real-world radius (50 km) into the equivalent radius_deg threshold.
- We use radius_deg = 50 / 111 to do this. The numerator (50) is the search radius in kilometers and denominator (111) is the approximate kilometers per degree at mid-latitudes.
- Thus radius_deg ≈ 0.45 degrees. Now any image whose planar degree-distance to New York City is less than 0.45 is roughly within 50 km.
Rank the spatially filtered rows by their semantic similarity with our query embedding.
- s.embedding <-> %s::vector AS vec_dist: Computes the vector distance between each image’s embedding (s.embedding) and the supplied query embedding (%s::vector):
- geo_pt <-> point(%s, %s) AS geo_dist: Recomputes the planar geospatial distance for each final row for reporting or tie-breaking
- Join the spatial_subset back to the full images table to retrieve all other columns (city, country, etc.).
- ORDER BY vec_dist, geo_dist: Sort the joined results by ascending semantic distance. We also include a tie-breaking or secondary sort, so if two images have nearly identical embeddings, we can break the tie by distance.
```
query_emb = get_embedding("modern architecture","text")

sql = """
WITH spatial_subset AS (
SELECT photo_id, embedding
FROM images
WHERE geo_pt <-> point(%s, %s) < %s
)
SELECT
i.photo_id,
i.photo_location_city,
i.photo_location_country,
s.embedding <-> %s::vector AS vec_dist,
geo_pt <-> point(%s, %s)       AS geo_dist
FROM spatial_subset s
JOIN images i USING(photo_id)
ORDER BY vec_dist, geo_dist
LIMIT %s;
"""
lat, lon = geocode_city("New York City")
radius_deg = 50/111  
params = (lon, lat, radius_deg, query_emb.tolist(), lon, lat, 5)
cur.execute(sql, params)
```
Reusing the display code from above, we will see

We can also experiment with using a query image and identify all similar images nearest a particular location.

Conclusion

In this blog post, we have performed multimodal queries by computing a CLIP embedding, ask Postgres for the 5 closest vectors, and render the matching photos in our notebook. We also combine semantic queries with spatial search to demonstrate the flexibility of using a hybrid database such as PostgreSQL.