In a previous blog post,, I demonstrated multimodal search against the Unsplash Lite dataset using Chroma. We saw how fast it was to setup the environment and perform similarity search using Chroma’s lightweight, in-memory vector database.
Will there be any scenarios where we want something that is beyond Chroma? In this Semantic Search with PostgreSQL series of posts, we will switch to a relational database - PostgreSQL - and find out.
This is #1 of the series in which we will setup the environment and ingest data. In #2, we will do the actual search and explore the pros and cons.
By the end of this post, we will have setup the followings:
- A Linux Distro in Windows Linux Subsystem (WSL). Since my primary machine is Windows, and I’ll run PostgreSQL on Linux, I will prepare my existing WSL with the right distro.
- PostgreSQL
- pgvector, an extension for PostgreSQL that supports vector search. Note that this install natively in Linux but needs to be manually built in Windows. That makes me start this whole project using WSL.
- VS Code in Windows is able to create and query tables in WSL PostgreSQL.
In the next blog post, we will build on top of this to do multimodal search, and compare the results with our previous experiment in Chroma.
Install a Linux Distro
Choose a distro
I had Alpine Linux distro in WSL 2 for a few years. It installs and loads fast with an extremely minimal footprint. The tradeoff is that a lot of packages are absent in the default repository and thus cannot be installed in a straight-forward way. Since I’ll migrate an existing production database over to take advantage of PostgreSQL’s extensions, I’ll like all these installs to be as problem-free as possible. But I still want a distro that is minimal both in installation and memory usage.
Reviewing PostgreSQL, pgvector and a few other extensions’ system requirements narrow me down to either Ubuntu and Debian.
A more in-depth research of these distros reveal that while Alpine leads in raw startup speed and lowest footprint, Debian offers a balance of performance and ease of use. Ubuntu carries the most defaults and services, making it slightly heavier but often more convenient. Since Debian is 50% the size of Ubuntu with a leaner base system, I pick Debian for its good balance between functionality and size.
- For a fresh install of WSL
- open Powershell or Command Prompt
- execute
wsl --install
- reboot the machine
- If you already have an existing WSL installation
- update it by
wsl --update
- restart it by
wsl --shutdown
- update it by
- To install Debian distro
execute
wsl --install -d Debian
once installation finishes, Debian will launch automatically. We’ll see prompts to:
- create a Linux username
- set and confirm a password
After log in, immediately update and upgrade packages with
sudo apt update && sudo apt full-upgrade -y
Run
wsl --list --all
in Powershell or Terminal to view a list of already installed distributions.Since I now have 2 distros (Alpine and Debian), I will set Debian as the default by
wsl --setdefault Debian
.Now, running
wsl
with no arguments will drop me into Debian.- I can re-run the list command
wsl --list --all
to confirm the “Default” tag marked to Debian.
- I can re-run the list command
If I just want to start Alpine or Debian without changing the default, I can use the following to switch back and forth instantly.
wsl -d Alpine wsl -d Debian
Manage distro footprint
We can list every distro’s install path and size with the following Powershell script, which walks the WSL registry entries, finds each distro’s folder, and sums its files.
Get-ChildItem "HKCU:\Software\Microsoft\Windows\CurrentVersion\Lxss" | ForEach-Object {
$props = Get-ItemProperty $_.PSPath
$name = $props.DistributionName
$path = $props.BasePath
$size = (Get-ChildItem -Recurse -LiteralPath $path |
Measure-Object -Property Length -Sum).Sum /1MB
[PSCustomObject]@{
Distro = $name
Directory = $path
SizeMB = [math]::Round($size,2)
}
} | Format-Table -AutoSize
I can see that Debian takes up ~1Gb while Alpine is at ~250Mb.
Or we can get a more detailed view by peeking directly into the distro.
Launch the distro by
wsl -d Debian
We can check total filesystem size and usage with
df -h /
, which shows the VHD’s allocated size, used space, and available space.We can also measure the actual used space under root with
sudo du -shx /*
.-s
gives a summary for each top-level directory-h
prints sizes in human-readable form-x
stays on the same filesystem (excludes/mnt
)
If we find a large VHDX and need to reclaim space, run
wsl --shutdown
and then use PowerShell’sOptimize-VHD
on the.vhdx
file.For an interactive view of where space is used inside Linux, install and run
ncdu
:
sudo apt install ncdu
sudo ncdu --exclude /mnt /
- We can also perform periodic cleanup with
sudo apt autoremove && sudo apt clean
.
Install Postgres SQL
Once we launch Debian, install PostgreSQL and the contrib utilities by
sudo apt install -y postgresql postgresql-contrib
.This creates the
postgres
service user and initializes our first database cluster.Start PostgreSQL:
sudo service postgresql start
Verify the installation and note the PostgreSQL version:
psql --version
We can also query a list of existing databases by
sudo -u postgres psql -c "\l"
Setup authentication over TCP
PostgreSQL uses peer authentication by default for the local Linux user. This means that there is no preset password for the PostgreSQL role postgres
.When we run sudo -u postgres psql
, we are authenticated by our Linux account. This works fine when we are in an interactive psql session.
However, if we plan to connect over TCP (e.g. from Windows dev tools) to Postgres, we’ll need to do the followings:
- Setup a password
- Have Postgres listening on TCP (localhost:5432)
- Switch from peer to password (md5) authentication for local TCP
Set postgres Password
Switch to the postgres account and open psql:
sudo -u postgres psql
At the
psql
prompt, set a new password:\password postgres
You’ll be prompted to enter and confirm your chosen password. Or you can execute
ALTER USER postgres WITH PASSWORD 'ANewPassword';
Exit
psql
:\q
Restart service:
sudo service postgresql restart
Double-check your password In WSL:
sudo -u postgres psql -c "SELECT rolname, rolpassword FROM pg_authid WHERE rolname='postgres';"
You should see a non-empty rolpassword in hash.
Setup listening on TCP localhost
Verify Postgres is listening on TCP by running this in WSL
sudo ss -ltnp | grep 5432
If we do not see something like
LISTEN 0 128 127.0.0.1:5432 *:* users:(("postgres",pid=1234,fd=5))
then Postgres isn’t bound to TCP. We will need to edit postgresql.conf
to bind to localhost. Do the followings:
- Open
\\wsl$\Debian\etc\postgresql\<ver>\main\postgresql.conf
from Windows - Locate or add:
listen_addresses = 'localhost'
- Save and then in WSL:
sudo service postgresql restart
. Re‐runss -ltnp
check to confirm.
Switch to MD5 Password Auth for TCP
Open the host-based auth file
\\wsl$\Debian\etc\postgresql\<ver>\main\pg_hba.conf
in your editor.Add these two lines near the top (after any local “peer” lines):
# allow TCP local connections host all all 127.0.0.1/32 md5 host all all ::1/128 md5
Save the file and restart Postgres:
sudo service postgresql restart
Test connection
Now we’re ready to test if our database works properly in different scenarios. We’ll:
- Start the service
- Connect via psql
- Create a test table, insert rows, query, and clean up
- Test it in socket and localhost modes
Socket Connection in WSL (peer auth)
Start PostgreSQL Service in WSL:
sudo service postgresql start
Verify it’s running:
sudo service postgresql status
Switch to the postgres Linux user and open the prompt:
sudo -u postgres psql
You should see a prompt like:
postgres=#
At the
postgres=#
prompt, enter the following SQL commands one section at a time:-- 1. Create a simple test table CREATE TABLE test_users ( id SERIAL PRIMARY KEY, name TEXT, age INTEGER ); -- 2. Insert rows INSERT INTO test_users (name, age) VALUES ('Alice', 30), ('Bob', 25), ('Carol', 28); -- 3. Query the table SELECT * FROM test_users;
After each command, you should see feedback:
CREATE TABLE → CREATE TABLE INSERT → INSERT 0 3 SELECT → table rows printed
Exit psql
\q
That returns us to the WSL shell. If all steps succeeded, our PostgreSQL instance is healthy and responsive in WSL.
TCP Connection in WSL (md5 auth)
Execute in WSL:
psql -h 127.0.0.1 -U postgres -d imagesdb -W
Enter the postgres password, then in the
psql
prompt, execute this to ensure the TCP connection works too.SELECT * FROM test_users
If this fails, we can test TCP with trust to bypass password temporarily as follows:
Edit
\\wsl$\Debian\etc\postgresql\<ver>\main\pg_hba.conf
to trust localhost TCP:# add these lines at the top of pg_hba.conf host all all 127.0.0.1/32 trust host all all ::1/128 trust
Restart Postgres:
sudo service postgresql restart
Retry the connection in Step 1 above
psql -h 127.0.0.1 -U postgres -d imagesdb
You should get a prompt without being asked for a password.
At that prompt, repeat
SELECT * FROM test_users
. If that works, it confirms your TCP binding is fine and the md5/password has issues.If you want to avoid messing with the super-user password, create a dedicated user:
sudo -u postgres createuser devuser --pwprompt sudo -u postgres createdb imagesdb -O devuser
Then login with this user:
psql -h 127.0.0.1 -U devuser -d imagesdb -W
Windows Connection via WSL Interop
- Open PowerShell.
- Prefix your psql call with wsl:
wsl psql -h localhost -U postgres -d imagesdb -W
- Enter your password when prompted.
- Switch back to Windows VS Code.
- Use the same credentials/connection settings to connect (host=localhost, port=5432, user=postgres, password you set). For detailed steps, see this later section.
Install pgvector extension
Next we will enable the pgvector extension for vector similarity search. We’ll install the version matching our PostgreSQL server.
Direct install via APT
If your Linux distro’s stock repo ships pgvector, you can simply do the following to install it
Confirm your major PostgreSQL version by
psql --version | awk '{print $3}' | cut -d. -f1
.To get a more detailed view of versioning, use
psql --version
.Install the matching pgvector package:
sudo apt update sudo apt install -y postgresql-<VERSION>-pgvector
Switch to the
postgres
user and enable the extension in your database.\dx
lists all the installed extensions, and we should seevector
listed.sudo -u postgres psql CREATE EXTENSION vector; \dx
Pull from PGDG repo
Since Debian’s stock repos ship only PostgreSQL 15 core but not pgvector, we need to pull it from the PGDG (PostgreSQL Global Development Group) repo.
Install
wget
:sudo apt install -y wget
Import the PGDG signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo tee /etc/apt/trusted.gpg.d/pgdg.asc
Find your Linux distro’s code name with
grep VERSION_CODENAME /etc/os-release
.For me, it’s
bookworm
which stands for Debian 12.Add the PGDG source list. Replace
bookworm
with the correct codename for your distroecho "deb http://apt.postgresql.org/pub/repos/apt/ \ bookworm-pgdg main" \ | sudo tee /etc/apt/sources.list.d/pgdg.list
We can then follow step 2 above to install and enable the extension.
Manually build and install
If you’d rather not add an external repo, you can compile pgvector yourself:
Install build tools and headers for Postgres:
sudo apt install -y postgresql-server-dev-all build-essential git
Clone, build and install pgvector:
git clone https://github.com/pgvector/pgvector.git cd pgvector make sudo make install
We can then repeat step 2 above to install and enable the extension.
Test the extension
Now let’s test the extension.
In
psql
, run:SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
We should see pgvector and its version.
Create a test table and insert a vector. If the query returns your row, pgvector is working!
CREATE TABLE items(id serial PRIMARY KEY, embedding vector(3)); INSERT INTO items (embedding) VALUES ('[1,2,3]'); SELECT * FROM items ORDER BY embedding <-> '[2,1,3]' LIMIT 1;
Working with WSL from VS Code in Windows
Since Debian WSL runs its own Python installation isolated from Windows, any packages we’ve installed in Windows don’t automatically appear in the Linux environment.
However, my current development environment (VS Code, Python and packages) is in Windows, so I need to find a way to access our Postgres database in WSL. There are several ways:
- reinstall everything in Debian
- run Windows Python interpreter (and .py files) in WSL
- connect to WSL Postgres from Windows, and use the usual dev environment
Reinstall Python & packages in WSL
This is tedious and adds footprint, but keeps Linux self-contained. It’s ideal if we want to dockerize our dev environment later.
Use Windows Python from within WSL
This is good when we need Linux file paths or tools but rely on our existing Python environment for installed packages.
Since WSL automatically adds Windows PATH to its own, we can run Windows executables by simply appending .exe
. For example, from Debian shell, we can print the path of our Windows-based Python (for me, in a virtual environment):
/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe -c "import sys; print(sys.executable)"
We can also list all the installed packages
/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe -m pip list
To confirm a particular package is available
python.exe -m pip list | grep pandas
If pandas
shows up, the Windows environment is in use.
We can then run our python scripts like this:
/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe test.py
Setup shortcuts
Since I install all my packages in a virtual environment, it will be very tedious to type up the full path every time. We can create a shortcut by define a custom alias that points directly to the venv’s Python executable.
In
/home/username/.bashrc
file, add this to the last line:alias winpy='/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe'
Reload:
source ~/.bashrc
Now we can simply run
winpy abc.py
to uses our.venv
Python interpreter every time.
By placing that alias
line in ~/.bashrc
, Bash will load it on every new interactive WSL shell. This includes a terminal or from VS Code. So we won’t have to manually source
it each time.
To double-check, open a fresh WSL window and run: type winpy
. It should say:
winpy is aliased to /mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe
Connect to WSL Postgres from Windows
This is the more convenient approach as we can keep our existing Windows venv and all installed packages, do Python/Jupyter work in Windows VS Code, and hit our Debian‐WSL Postgres over localhost. We are not bound to only running .py scripts or using the Python interactive shell. And there’s no need to hop shells or reinstall anything!
In previous sections, we have setup Postgres to listening on TCP and switched to MD5 Password Authentication for local TCP.
We’ll now use a minimal Pyton script that connects to WSL Postgres, queries a test table, and prints its rows. We’ll target the default postgres database.
We created a test_users table in a previous step. Confirm it lives in Postgres with the following in WSL (using the easiest: socket connection - peer auth mode). This prints out the rows.
sudo -u postgres psql -d postgres -c "SELECT * FROM test_users;"
Define a User Variable in Windows for our Postgres password, so we do not need to pass it as plain text in code for security and privacy.
- Press Win + S, type “Environment Variables” and open
Edit the system environment variables
. - In
System Properties
clickEnvironment Variables…
- Under
User variables
clickNew…
- Variable name: PG_PASS
- Variable value: YourPassword
- Click OK through all dialogs.
- Press Win + S, type “Environment Variables” and open
Open VS Code in Windows
Create a file test_users_query.py and temporarily print out the env variable:
import os print("PG_PASS =", os.getenv("PG_PASS"))
If it works, replace it with the following code.
python import os import psycopg2 def main(): conn = psycopg2.connect( host="127.0.0.1", port=5432, dbname="postgres", user="postgres", password=os.getenv("PG_PASS") ) cur = conn.cursor() cur.execute("SELECT id, name, age FROM test_users ORDER BY id;") rows = cur.fetchall() print("test_users contents:") for r in rows: print(f" id={r[0]}, name={r[1]}, age={r[2]}") cur.close() conn.close() if __name__ == "__main__": main()
Note that by default, if you pass
host="localhost"
, psycopg2 may try both UNIX socket and IPv6. Instead, we explicitly use an IP address so it is forced to use an IPv4/TCP connection.Running this will pull PG_PASS from the global environment. We should see:
test_users contents: id=1, name=Alice, age=30 id=2, name=Bob, age=25 id=3, name=Carol, age=28
With this, we have successfully verified the connection from our Windows dev environment to WSL Postgres, and queried an existing table.
Conclusion
In this blog post, I have illustrated how to setup PostgreSQL as both a relational and vector database in WSL. We perform connection testing to ensure everything works from VS Code Windows. In my next blog post, I am going to do multimodal search against PostgreSQL.
Comments