This is #2 of the Semantic Search with PostgreSQL series. In #1, I have setup PostgreSQL and pgvector extension. This time, let’s build on top of these and perform different queries using PostgreSQL with pgvector extension.
By the end of this post, we will be able to search for photos using an arbituary location, natural language and image. We will also be able to combine everything to pull only photos shot by Nikon within 50km of New York City that contains modern architecture. We’ll compare supported features and performance between PostgreSQL and Chroma as vector engines.
Why PostgreSQL?
In the previous experiment, we loaded the Unsplash 25k image embeddings with their respective metadata into Chroma, and perform semantic search. Chroma is super fast with embeddings calculation and light-duty metadata search with basic filtering.
However, in scenarios where the data is more complex (e.g., e-commerce with photos, product info, user and transactions), a real relational database that allows joining tables will serve much better. Or in my existing project, where I need semantic (photo and text), relational (joining various tables) AND geospatial search, a more versatile platform such as PostgreSQL with extensions will fit the bill perfectly.
In our experiment, the Unsplash lite dataset contains several files, each linked up by the photo_id
. Imagine this trip planning scenario:
- I have a photo of a landmark (e.g., Tokyo Tower)
- I want to find the most similar photos to it
- The targets must be shot around Tokyo (or, if we are adventurous, we can target Shanghai to see if there’s anything there that looks like Tokyo Tower)
- Once I have the targets, I want to check out the photo collections they are in, to expand my exploration. For example, if I get Oriental Pearl Tower in Shanghai as a target, it might have been added to a photo collection of other Shanghai modern skyscrappers. Now I can visit more places in my trip!
This scenario will be very hard to do in Chroma, especially #3 (spatial search) and #4 (the classic SQL join), and that’s when PostgreSQL comes in.
Obtain Unsplash dataset
Get all metadata
note: this is a rehash of what we did in the previous post. I am reposting it here for clarity.
First, we need to download the Unsplash Lite dataset with metadata such as photo ID, actual image URL, photographer name, description generated by AI, geolocation, image EXIF data, keyword used for the iamge, colors, etc.
After downloading the file, we load the data into a dataframe.
import numpy as np
import pandas as pd
import glob
path = '.'
documents = ['photos', 'keywords', 'collections', 'conversions', 'colors']
datasets = {}
for doc in documents:
print(doc)
files = glob.glob(doc + ".csv000*")
print(files)
subsets = []
for filename in files:
print(filename)
df = pd.read_csv(filename, sep='\t', header=0)
subsets.append(df)
if subsets:
datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)
else:
# Handle the empty case
datasets[doc] = pd.DataFrame()
Get precomputed image embeddings
note: again this is a rehash of the previous post.
For the 25k Unsplash images, we will use precomputed image embeddings provided by the SentenceTranformer documentation.
import requests
url = "http://sbert.net/datasets/unsplash-25k-photos-embeddings.pkl"
response = requests.get(url)
with open("unsplash-25k-photos-embeddings.pkl","wb") as file:
file.write(response.content)
We then deserialize the file to get both the image names and embeddings.
import pickle
emb_filename="unsplash-25k-photos-embeddings.pkl"
with open(emb_filename, 'rb') as fIn:
img_names, img_emb = pickle.load(fIn)
Get source images
We’ll also save the 25k Unsplash images for offline use, per SentenceTranformer documentation.
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
os.makedirs(img_folder, exist_ok=True)
photo_filename = 'unsplash-25k-photos.zip'
if not os.path.exists(photo_filename): #Download dataset if does not exist
util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
#Extract all images
with zipfile.ZipFile(photo_filename, 'r') as zf:
for member in tqdm(zf.infolist(), desc='Extracting'):
zf.extract(member, img_folder)
Ingest to Postgres
Next, we will use PostgreSQL to store the embeddings, together with their respective metadata.
Prepare data
Since we want to link up the embeddings to their respective metadata, we need to find a common value between them.
- in the embedding pickle file, the
img_names
is in the formatxyz.jpg
- in
datasets['photos']
dataframe, aphoto_id
column stores each image name without the.jpg
suffix
Thus we will
- Strip the
.jpg
suffix from the pickle’simg_names
- Rename that column to
photo_id
- Merge it with the dataframe using its
photo_id
column
First we strip the suffix
photo_ids = [name.rsplit('.', 1)[0] for name in img_names]
We then create a DataFrame with one row per image and an embedding
column whose entries are vectors we can push into pgvector later. For the embeddings, we normalize them to a 1-D Python list where each element is itself a 1-D list/array, so psycopg2 serializes it correctly.
df = pd.DataFrame({
'photo_id': photo_ids,
'embedding': [row for row in img_emb]
})
We then merge the embeddings with metadata, unified by the photo_id
key.
metadata = datasets['photos']
df = df.merge(metadata, on='photo_id', validate='one_to_one')
Later, we will use photo_id
as the primary key in our Postgres table DDL.
Create a new database
Create a dedicated database:
sudo-u postgres createdb imagesdb
Enable the vector extension:
sudo -u postgres psql -d imagesdb -c "CREATE EXTENSION IF NOT EXISTS vector;"
Connect to database from Python
Install psycopg2 (the PostgreSQL adapter for Python):
pip install psycopg2-binary
Connect to the newly created
imagesdb
databaseconn = psycopg2.connect( host="127.0.0.1", port=5432, dbname="imagesdb", user="postgres", password=os.getenv("PG_PASS") ) cur = conn.cursor()
Enable pgvector extension
Even though we’ve installed the pgvector package in the previous post, the vector data type only exists in databases where we’ve enabled the extension. we have to run: CREATE EXTENSION vector;
in every database where we plan to use the VECTOR(...)
data type.
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()
Create a table
We need to make sure the table’s VECTOR(<dim>)
matches our actual embedding length.
Since the metadata contains a column stats_views
with a maximum value of 2.7 billion, we have to use BIGINT for that column. Ideally, we’ll detect overflows in any integer column and apply BIGINT automatically.
dim = len(df['embedding'].iloc[0])
int32_max = np.iinfo(np.int32).max
col_defs = []
for col, dtype in df.drop(['photo_id','embedding'], axis=1).dtypes.items():
if dtype.kind == 'i':
pg_type = 'BIGINT' if df[col].max() > int32_max else 'INTEGER'
elif dtype.kind == 'f':
pg_type = 'DOUBLE PRECISION'
else:
pg_type = 'TEXT'
col_defs.append(f"{col} {pg_type}")
create_sql = f"""
DROP TABLE IF EXISTS images;
CREATE TABLE images (
photo_id TEXT PRIMARY KEY,
embedding VECTOR({dim}),
{', '.join(col_defs)}
);
"""
cur.execute(create_sql)
conn.commit()
Insert rows
columns = df.columns.tolist()
insert_sql = sql.SQL("INSERT INTO images ({}) VALUES %s").format(
sql.SQL(', ').join(map(sql.Identifier, columns))
)
records = [
tuple([row['photo_id'], row['embedding']] +
[row[c] for c in columns if c not in ('photo_id','embedding')])
for _, row in df.iterrows()
]
execute_values(cur, insert_sql.as_string(cur), records, page_size=1000)
conn.commit()
Verify
Login to postgres and connect to the
imagesdb
database we just createdsudo -u postgres psql -d imagesdb
List all tables in the database:
\dt
We’ll see the
images
table we just created.Examine the
images
table’s schema (column names, data types, default values, indices) with\d images
, or get extended details (including storage and size) with\d+ images
Inspect
images
table’s row count to make sure everything is ingested:SELECT COUNT(*) FROM images;
Query
First let’s connect to our database
import psycopg2
conn = psycopg2.connect(
host="127.0.0.1", port=5432,
dbname="imagesdb", user="postgres",
password=os.getenv("PG_PASS")
)
cur = conn.cursor()
Search metadata
Let’s now try a simple query to display the first 5 images taken by Canon, using the exif_camera_make
column in the metadata.
from IPython.display import HTML, display
cur.execute("SELECT photo_id FROM images WHERE exif_camera_make ILIKE %s LIMIT 5;", ("Canon",))
ids = [r[0] for r in cur.fetchall()]
img_tags = ""
for pid in ids:
path = f"photos/{pid}.jpg"
if os.path.exists(path):
img_tags += f"<div style='margin:4px'><img src='{path}' width='200'></div>"
else:
img_tags += f"<div style='margin:4px;color:red'>missing: {pid}.jpg</div>"
html = f"""
<div style="
display: flex;
flex-direction: row;
align-items: flex-start;
flex-wrap: nowrap;
">
{img_tags}
</div>
"""
display(HTML(html))
we will get the following
Semantic query
Let’s reuse the embedding function from my previous post to generate embeddings for both text and image queries.
def get_embedding(query, type):
"""
Obtain an embedding using SentenceTransformer with CLIP.
This function handles both text and image, denoted by "type"
For image, if the query string starts with "http", it downloads the image.
Otherwise, it treats the query as a local file path.
Returns a NumPy array embedding.
"""
from PIL import Image
import requests
from io import BytesIO
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('clip-ViT-B-32')
if type=="image":
# Check if query is an online URL.
if query.startswith("http"):
response = requests.get(query)
if response.status_code != 200:
raise Exception(f"Failed to download image from URL: {query}")
image = Image.open(BytesIO(response.content))
else:
image = Image.open(query)
# Get the image embedding as a NumPy array.
embedding = model.encode(image, convert_to_tensor=False)
else:
embedding = model.encode(query)
return embedding
Query with text
We can then do a text query to find the most similar images with a text phrase. We will be using pgvector’s K-NN search operator (<->) to rank by distance. The steps are
Compute a CLIP-ViT-B/32 embedding for our query text
query_emb = get_embedding("beautiful lake","text")
Execute a SQL query against the
images
table. Let’s break this down:- In psycopg2,
%s
is a placeholder for a query parameter. When we callcur.execute(sql, (query_emb.tolist(),))
later, psycopg2 takesquery_emb.tolist()
and substitutes it for%s
in the SQL statement, handling string quoting and escaping exactly as Postgres expects. ::vector
casts that query embedding array into the pgvector vector type- The
<->
operator now accepts the parameter as a true vector and computes the nearest-neighbor Euclidean distance. AS distance
renames that computed distance so we can reference it inORDER BY
, to sorts the result set in ascending order.query_emb.tolist()
: turns the embedding NumPy array into a plain Python list of floats. psycopg2 does not know how to adapt a NumPy ndarray, but can serialize a Python list into a Postgres array literal (e.g. {0.1,0.2,…}).
# Parameterized KNN query: smallest distance first sql = """ SELECT photo_id, exif_camera_make, embedding <-> %s::vector AS distance FROM images ORDER BY distance LIMIT 5; """ cur.execute(sql, (query_emb.tolist(),)) results = cur.fetchall()
- In psycopg2,
Display the thumbnails inline, sorted by semantic distance from the text query.
items = [] for pid, make, dist in results: img_src = f"photos/{pid}.jpg" items.append(f""" <div style="margin:4px;text-align:center"> <img src="{img_src}" width="200"><br> <small>{pid}<br>{make}<br>d={dist:.2f}</small> </div> """) html = f"<div style='display:flex'>{''.join(items)}</div>" display(HTML(html))
We get this result:
Query with image
Similarly, we can do an image query with the same Tokyo Tower photo we used last time. Just change the first line of code to
query_emb = get_embedding("tokyo-wallpaper.jpg","image")
and run it. We will get
Likewise, we can query with an online image, such as this Mt. Fuji photo from Wikipedia. Just substitute the first 2 lines of code with
imgURL="https://upload.wikimedia.org/wikipedia/commons/f/f8/View_of_Mount_Fuji_from_%C5%8Cwakudani_20211202.jpg"
query_emb = get_embedding(imgURL,"image")
We will get this
Semantic and metadata search
What if we want to search with an image or natural language query, but also scope it to return photos taken by a specific camera make (e.g. Canon)? In this case, we just need to add a WHERE clause to the SQL statement, just like any normal TSQL query.
sql = """
SELECT photo_id,
exif_camera_make,
embedding <-> %s::vector AS distance
FROM images
WHERE exif_camera_make ILIKE %s
ORDER BY distance
LIMIT 5;
"""
params = (query_emb.tolist(), 'Canon%')
cur.execute(sql, params)
This filters for rows where the camera make starts with “Canon”, case-insensitively. We can use = %s
in the SQL statement if we want an exact match, or pass the parameter as %Canon%
for substring matching.
Notice now we have two
%s
placeholders and a two-element tuple for the parameters. We can pass any camera make (or even a list) to ‘params` without changing the SQL string.
Rerun our code with this updated SQL statement for the Wikipedia Mt. Fuji photo, we get the following results:
Multilingual search
We can also perform text search using multiple languages by encoding our query text with the clip-ViT-B-32-multilingual-v1
text embedding model, which is aligned to the clip-ViT-B-32
and maps 50+ languages to the same vector space. For image query, we can continue to use the same clip-ViT-B-32
model.
Spatial search
One of the factors driving my switch from Chroma to Postgres is spatial search. There are all sorts of powerful extensions such as PostGIS, earthdistance/cube (provides earth-curved accuracy without PostGIS), etc. But even the native planar support from PostgreSQL proves to be quite useful for small areas. Let’s see!
note: The Unsplash Lite dataset comes with limited geodata, as only 28% of images have complete lat/long data. Nevertheless, this section is a quick proof-of-concept to show the versatility of using Postgres.
Update table for geometric data
We’ll first update our table schema to take advantage of PostgreSQL’s native geometric types such as point
, which is a a 2D coordinate (x,y). Other native geometric types include lseg, box and path, which we are not going to used for this test.
ALTER TABLE images
ADD COLUMN geo_pt point;
UPDATE images
SET geo_pt = point(photo_location_longitude, photo_location_latitude);
ALTER TABLE
Geocoding
We’ll then use Nominatim, a free OpenStreetMap geocoding API to obtain the latitude and longitude of a place programmatically. First install it with pip install geopy
.
Then we will write our geocode script to resolve a location name with its latitude and longitude:
from geopy.geocoders import Nominatim
def geocode_city(city_name: str):
geolocator = Nominatim(user_agent="geo_demo")
loc = geolocator.geocode(city_name)
if not loc:
raise ValueError(f"Couldn’t geocode: {city_name}")
return loc.latitude, loc.longitude
location="Kyoto"
lat, lon = geocode_city(location)
print(f"{location} → lat: {lat:.6f}, lon: {lon:.6f}")
We’ll get Kyoto → lat: 35.011575, lon: 135.768144
successfully.
Simple spatial search
Now we have Kyoto’s geodata. Let’s find 5 images closest to it.
Our updated SQL statement looks like this, and takes advantage of the updated table with support for point
data type. Let’s break down the spatial search part:
- in
geo_pt <-> point(%s, %s)
, the<->
operator on two PostgreSQL point values returns their Euclidean (straight-line) distance between two points.- This is identical to when we compare embeddings, only that
<->
always returns a result in units of the point’s coordinate system. Sincegeo_pt
was created aspoint(longitude, latitude)
, that distance is measured in degrees of longitude/latitude, not in meters or kilometers.
- This is identical to when we compare embeddings, only that
point(%s, %s)
constructs an in-query point from the supplied longitude and latitude parameters incur.execute(sql, (lon, lat, 5))
below.
note: For more advanced usage, we can search by containment (e.g. box @> point), && for bounding-box intersection, as well as use functions such as area(), length(), center(), perimeter().
sql = """
SELECT
photo_id,
photo_location_city,
photo_location_country,
geo_pt <-> point(%s, %s) AS planar_dist
FROM images
WHERE geo_pt IS NOT NULL
ORDER BY planar_dist
LIMIT %s;
"""
We can then execute the query by passing along Kyoto’s geodata and displaying the results:
cur.execute(sql, (lon, lat, 5))
results = cur.fetchall()
items = []
for pid, city, country, dist in results:
img_src = f"photos/{pid}.jpg"
items.append(f"""
<div style="margin:4px;text-align:center">
<img src="{img_src}" width="200"><br>
<small>{pid}<br>{city}<br>d={dist:.4f}</small>
</div>
""")
html = f"<div style='display:flex'>{''.join(items)}</div>"
display(HTML(html))
Spatial and semantic search
We can combine our pgvector semantic k-NN and point-based spatial index to get “the most similar images to an image/phrase within X km of Tokyo.”
The beauty of this is that we’re mixing in natural language (e.g., to describe an ambience, particular object, etc.) to search for images nearest a location. In the following example, we want to look up 5 modern architecture photos within 50km of New York City.
Notice that this is a two step process. In our example, we want to find images within 50km of New York City.
Use a spatial pre-filter
WITH spatial_subset AS
to narrow down candidates by geography. This is to reduce computing expensive vector distances. In the spatial searchWHERE geo_pt <-> point(%s, %s) < %s
< %s
accepts only all images whosegeo_pt
is smaller a threshold value (the third %s), which isradius_deg
.- Because our
<->
operator distance is in degrees, we must translate our real-world radius (50 km) into the equivalentradius_deg
threshold. - We use
radius_deg = 50 / 111
to do this. The numerator (50) is the search radius in kilometers and denominator (111) is the approximate kilometers per degree at mid-latitudes. - Thus
radius_deg
≈ 0.45 degrees. Now any image whose planar degree-distance to New York City is less than 0.45 is roughly within 50 km.
Rank the spatially filtered rows by their semantic similarity with our query embedding.
s.embedding <-> %s::vector AS vec_dist
: Computes the vector distance between each image’s embedding (s.embedding) and the supplied query embedding (%s::vector):geo_pt <-> point(%s, %s) AS geo_dist
: Recomputes the planar geospatial distance for each final row for reporting or tie-breaking- Join the spatial_subset back to the full images table to retrieve all other columns (city, country, etc.).
ORDER BY vec_dist, geo_dist
: Sort the joined results by ascending semantic distance. We also include a tie-breaking or secondary sort, so if two images have nearly identical embeddings, we can break the tie by distance.
query_emb = get_embedding("modern architecture","text") sql = """ WITH spatial_subset AS ( SELECT photo_id, embedding FROM images WHERE geo_pt <-> point(%s, %s) < %s ) SELECT i.photo_id, i.photo_location_city, i.photo_location_country, s.embedding <-> %s::vector AS vec_dist, geo_pt <-> point(%s, %s) AS geo_dist FROM spatial_subset s JOIN images i USING(photo_id) ORDER BY vec_dist, geo_dist LIMIT %s; """ lat, lon = geocode_city("New York City") radius_deg = 50/111 params = (lon, lat, radius_deg, query_emb.tolist(), lon, lat, 5) cur.execute(sql, params)
Reusing the display code from above, we will see
We can also experiment with using a query image and identify all similar images nearest a particular location.
Conclusion
In this blog post, we have performed multimodal queries by computing a CLIP embedding, ask Postgres for the 5 closest vectors, and render the matching photos in our notebook. We also combine semantic queries with spatial search to demonstrate the flexibility of using a hybrid database such as PostgreSQL.
Comments