In a previous blog post,, I demonstrated multimodal search against the Unsplash Lite dataset using Chroma. We saw how fast it was to setup the environment and perform similarity search using Chroma’s lightweight, in-memory vector database.

Will there be any scenarios where we want something that is beyond Chroma? In this Semantic Search with PostgreSQL series of posts, we will switch to a relational database - PostgreSQL - and find out.

This is #1 of the series in which we will setup the environment and ingest data. In #2, we will do the actual search and explore the pros and cons.

By the end of this post, we will have setup the followings:

  • A Linux Distro in Windows Linux Subsystem (WSL). Since my primary machine is Windows, and I’ll run PostgreSQL on Linux, I will prepare my existing WSL with the right distro.
  • PostgreSQL
  • pgvector, an extension for PostgreSQL that supports vector search. Note that this install natively in Linux but needs to be manually built in Windows. That makes me start this whole project using WSL.
  • VS Code in Windows is able to create and query tables in WSL PostgreSQL.

In the next blog post, we will build on top of this to do multimodal search, and compare the results with our previous experiment in Chroma.

Install a Linux Distro

Choose a distro

I had Alpine Linux distro in WSL 2 for a few years. It installs and loads fast with an extremely minimal footprint. The tradeoff is that a lot of packages are absent in the default repository and thus cannot be installed in a straight-forward way. Since I’ll migrate an existing production database over to take advantage of PostgreSQL’s extensions, I’ll like all these installs to be as problem-free as possible. But I still want a distro that is minimal both in installation and memory usage.

Reviewing PostgreSQL, pgvector and a few other extensions’ system requirements narrow me down to either Ubuntu and Debian.

A more in-depth research of these distros reveal that while Alpine leads in raw startup speed and lowest footprint, Debian offers a balance of performance and ease of use. Ubuntu carries the most defaults and services, making it slightly heavier but often more convenient. Since Debian is 50% the size of Ubuntu with a leaner base system, I pick Debian for its good balance between functionality and size.

  1. For a fresh install of WSL
    • open Powershell or Command Prompt
    • execute wsl --install
    • reboot the machine
  2. If you already have an existing WSL installation
    • update it by wsl --update
    • restart it by wsl --shutdown
  3. To install Debian distro
    1. execute wsl --install -d Debian

    2. once installation finishes, Debian will launch automatically. We’ll see prompts to:

      • create a Linux username
      • set and confirm a password
    3. After log in, immediately update and upgrade packages with

      sudo apt update && sudo apt full-upgrade -y

    4. Run wsl --list --all in Powershell or Terminal to view a list of already installed distributions.

    5. Since I now have 2 distros (Alpine and Debian), I will set Debian as the default by wsl --setdefault Debian.

      Now, running wsl with no arguments will drop me into Debian.

      • I can re-run the list command wsl --list --all to confirm the “Default” tag marked to Debian.
    6. If I just want to start Alpine or Debian without changing the default, I can use the following to switch back and forth instantly.

      wsl -d Alpine
      wsl -d Debian
      

Manage distro footprint

We can list every distro’s install path and size with the following Powershell script, which walks the WSL registry entries, finds each distro’s folder, and sums its files.

Get-ChildItem "HKCU:\Software\Microsoft\Windows\CurrentVersion\Lxss" | ForEach-Object {
  $props = Get-ItemProperty $_.PSPath
  $name  = $props.DistributionName
  $path  = $props.BasePath
  $size  = (Get-ChildItem -Recurse -LiteralPath $path |
             Measure-Object -Property Length -Sum).Sum /1MB
  [PSCustomObject]@{
    Distro    = $name
    Directory = $path
    SizeMB    = [math]::Round($size,2)
  }
} | Format-Table -AutoSize

I can see that Debian takes up ~1Gb while Alpine is at ~250Mb.

Or we can get a more detailed view by peeking directly into the distro.

  1. Launch the distro by wsl -d Debian

  2. We can check total filesystem size and usage with df -h /, which shows the VHD’s allocated size, used space, and available space.

  3. We can also measure the actual used space under root with

    sudo du -shx /* .

    • -s gives a summary for each top-level directory
    • -h prints sizes in human-readable form
    • -x stays on the same filesystem (excludes /mnt)
  4. If we find a large VHDX and need to reclaim space, run wsl --shutdown and then use PowerShell’s Optimize-VHD on the .vhdx file.

  5. For an interactive view of where space is used inside Linux, install and run ncdu:

  sudo apt install ncdu
  sudo ncdu --exclude /mnt /
  1. We can also perform periodic cleanup with sudo apt autoremove && sudo apt clean.

Install Postgres SQL

  1. Once we launch Debian, install PostgreSQL and the contrib utilities by

    sudo apt install -y postgresql postgresql-contrib.

    This creates the postgres service user and initializes our first database cluster.

  2. Start PostgreSQL: sudo service postgresql start

  3. Verify the installation and note the PostgreSQL version:

    psql --version
    

    We can also query a list of existing databases by

    sudo -u postgres psql -c "\l"
    

Setup authentication over TCP

PostgreSQL uses peer authentication by default for the local Linux user. This means that there is no preset password for the PostgreSQL role postgres.When we run sudo -u postgres psql, we are authenticated by our Linux account. This works fine when we are in an interactive psql session.

However, if we plan to connect over TCP (e.g. from Windows dev tools) to Postgres, we’ll need to do the followings:

  • Setup a password
  • Have Postgres listening on TCP (localhost:5432)
  • Switch from peer to password (md5) authentication for local TCP

Set postgres Password

  1. Switch to the postgres account and open psql: sudo -u postgres psql

  2. At the psql prompt, set a new password: \password postgres

    You’ll be prompted to enter and confirm your chosen password. Or you can execute

    ALTER USER postgres WITH PASSWORD 'ANewPassword';

  3. Exit psql: \q

  4. Restart service: sudo service postgresql restart

  5. Double-check your password In WSL:

    sudo -u postgres psql -c "SELECT rolname, rolpassword FROM pg_authid WHERE rolname='postgres';"
    

    You should see a non-empty rolpassword in hash.

Setup listening on TCP localhost

Verify Postgres is listening on TCP by running this in WSL

sudo ss -ltnp | grep 5432

If we do not see something like

      LISTEN 0  128  127.0.0.1:5432  *:* users:(("postgres",pid=1234,fd=5))

then Postgres isn’t bound to TCP. We will need to edit postgresql.conf to bind to localhost. Do the followings:

  1. Open \\wsl$\Debian\etc\postgresql\<ver>\main\postgresql.conf from Windows
  2. Locate or add: listen_addresses = 'localhost'
  3. Save and then in WSL: sudo service postgresql restart . Re‐run ss -ltnp check to confirm.

Switch to MD5 Password Auth for TCP

  1. Open the host-based auth file \\wsl$\Debian\etc\postgresql\<ver>\main\pg_hba.conf in your editor.

  2. Add these two lines near the top (after any local “peer” lines):

    # allow TCP local connections
    host    all     all     127.0.0.1/32    md5
    host    all     all     ::1/128         md5
    
  3. Save the file and restart Postgres:

    sudo service postgresql restart

Test connection

Now we’re ready to test if our database works properly in different scenarios. We’ll:

  • Start the service
  • Connect via psql
  • Create a test table, insert rows, query, and clean up
  • Test it in socket and localhost modes

Socket Connection in WSL (peer auth)

  1. Start PostgreSQL Service in WSL:

    sudo service postgresql start

  2. Verify it’s running:

    sudo service postgresql status

  3. Switch to the postgres Linux user and open the prompt: sudo -u postgres psql

    You should see a prompt like: postgres=#

  4. At the postgres=# prompt, enter the following SQL commands one section at a time:

    -- 1. Create a simple test table
    CREATE TABLE test_users (
    id SERIAL PRIMARY KEY,
    name TEXT,
    age INTEGER
    );
    
    -- 2. Insert rows
    INSERT INTO test_users (name, age) VALUES
    ('Alice', 30),
    ('Bob',   25),
    ('Carol', 28);
    
    -- 3. Query the table
    SELECT * FROM test_users;
    

    After each command, you should see feedback:

    CREATE TABLE → CREATE TABLE
    INSERT → INSERT 0 3
    SELECT → table rows printed
    
  5. Exit psql \q

    That returns us to the WSL shell. If all steps succeeded, our PostgreSQL instance is healthy and responsive in WSL.

TCP Connection in WSL (md5 auth)

  1. Execute in WSL:

    psql -h 127.0.0.1 -U postgres -d imagesdb -W

  2. Enter the postgres password, then in the psql prompt, execute this to ensure the TCP connection works too.

    SELECT * FROM test_users

    If this fails, we can test TCP with trust to bypass password temporarily as follows:

  3. Edit \\wsl$\Debian\etc\postgresql\<ver>\main\pg_hba.conf to trust localhost TCP:

    # add these lines at the top of pg_hba.conf
    host  all  all  127.0.0.1/32  trust
    host  all  all  ::1/128      trust
    
  4. Restart Postgres: sudo service postgresql restart

  5. Retry the connection in Step 1 above

    psql -h 127.0.0.1 -U postgres -d imagesdb
    

    You should get a prompt without being asked for a password.

  6. At that prompt, repeat SELECT * FROM test_users. If that works, it confirms your TCP binding is fine and the md5/password has issues.

    If you want to avoid messing with the super-user password, create a dedicated user:

    sudo -u postgres createuser devuser --pwprompt
    sudo -u postgres createdb imagesdb -O devuser
    

    Then login with this user: psql -h 127.0.0.1 -U devuser -d imagesdb -W

Windows Connection via WSL Interop

  1. Open PowerShell.
  2. Prefix your psql call with wsl:
    wsl psql -h localhost -U postgres -d imagesdb -W
    
  3. Enter your password when prompted.
  4. Switch back to Windows VS Code.
  5. Use the same credentials/connection settings to connect (host=localhost, port=5432, user=postgres, password you set). For detailed steps, see this later section.

Install pgvector extension

Next we will enable the pgvector extension for vector similarity search. We’ll install the version matching our PostgreSQL server.

Direct install via APT

If your Linux distro’s stock repo ships pgvector, you can simply do the following to install it

  1. Confirm your major PostgreSQL version by

    psql --version | awk '{print $3}' | cut -d. -f1 .

    To get a more detailed view of versioning, use psql --version.

  2. Install the matching pgvector package:

    sudo apt update
    sudo apt install -y postgresql-<VERSION>-pgvector
    
  3. Switch to the postgres user and enable the extension in your database.

    \dx lists all the installed extensions, and we should see vector listed.

    sudo -u postgres psql
    CREATE EXTENSION vector;
    \dx
    

Pull from PGDG repo

Since Debian’s stock repos ship only PostgreSQL 15 core but not pgvector, we need to pull it from the PGDG (PostgreSQL Global Development Group) repo.

  1. Install wget : sudo apt install -y wget

  2. Import the PGDG signing key:

    wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo tee /etc/apt/trusted.gpg.d/pgdg.asc

  3. Find your Linux distro’s code name with

    grep VERSION_CODENAME /etc/os-release.

    For me, it’s bookworm which stands for Debian 12.

  4. Add the PGDG source list. Replace bookworm with the correct codename for your distro

    echo "deb http://apt.postgresql.org/pub/repos/apt/ \
      bookworm-pgdg main" \
      | sudo tee /etc/apt/sources.list.d/pgdg.list
    
  5. We can then follow step 2 above to install and enable the extension.

Manually build and install

If you’d rather not add an external repo, you can compile pgvector yourself:

  1. Install build tools and headers for Postgres:

    sudo apt install -y postgresql-server-dev-all build-essential git

  2. Clone, build and install pgvector:

    git clone https://github.com/pgvector/pgvector.git
    cd pgvector
    make
    sudo make install
    

    We can then repeat step 2 above to install and enable the extension.

Test the extension

Now let’s test the extension.

  1. In psql, run:

    SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';

    We should see pgvector and its version.

  2. Create a test table and insert a vector. If the query returns your row, pgvector is working!

    CREATE TABLE items(id serial PRIMARY KEY, embedding vector(3));
    INSERT INTO items (embedding) VALUES ('[1,2,3]');
    SELECT * FROM items ORDER BY embedding <-> '[2,1,3]' LIMIT 1;
    

Working with WSL from VS Code in Windows

Since Debian WSL runs its own Python installation isolated from Windows, any packages we’ve installed in Windows don’t automatically appear in the Linux environment.

However, my current development environment (VS Code, Python and packages) is in Windows, so I need to find a way to access our Postgres database in WSL. There are several ways:

  • reinstall everything in Debian
  • run Windows Python interpreter (and .py files) in WSL
  • connect to WSL Postgres from Windows, and use the usual dev environment

Reinstall Python & packages in WSL

This is tedious and adds footprint, but keeps Linux self-contained. It’s ideal if we want to dockerize our dev environment later.

Use Windows Python from within WSL

This is good when we need Linux file paths or tools but rely on our existing Python environment for installed packages.

Since WSL automatically adds Windows PATH to its own, we can run Windows executables by simply appending .exe. For example, from Debian shell, we can print the path of our Windows-based Python (for me, in a virtual environment):

/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe -c "import sys; print(sys.executable)"

We can also list all the installed packages

 /mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe -m pip list

To confirm a particular package is available

python.exe -m pip list | grep pandas

If pandas shows up, the Windows environment is in use.

We can then run our python scripts like this:

/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe test.py 

Setup shortcuts

Since I install all my packages in a virtual environment, it will be very tedious to type up the full path every time. We can create a shortcut by define a custom alias that points directly to the venv’s Python executable.

  1. In /home/username/.bashrc file, add this to the last line:

    alias winpy='/mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe'

  2. Reload: source ~/.bashrc

  3. Now we can simply run winpy abc.py to uses our .venv Python interpreter every time.

By placing that alias line in ~/.bashrc, Bash will load it on every new interactive WSL shell. This includes a terminal or from VS Code. So we won’t have to manually source it each time.

To double-check, open a fresh WSL window and run: type winpy. It should say:

winpy is aliased to /mnt/c/Users/k/CodeProjects/.venv/Scripts/python.exe

Connect to WSL Postgres from Windows

This is the more convenient approach as we can keep our existing Windows venv and all installed packages, do Python/Jupyter work in Windows VS Code, and hit our Debian‐WSL Postgres over localhost. We are not bound to only running .py scripts or using the Python interactive shell. And there’s no need to hop shells or reinstall anything!

In previous sections, we have setup Postgres to listening on TCP and switched to MD5 Password Authentication for local TCP.

We’ll now use a minimal Pyton script that connects to WSL Postgres, queries a test table, and prints its rows. We’ll target the default postgres database.

  1. We created a test_users table in a previous step. Confirm it lives in Postgres with the following in WSL (using the easiest: socket connection - peer auth mode). This prints out the rows.

    sudo -u postgres psql -d postgres -c "SELECT * FROM test_users;"
    
  2. Define a User Variable in Windows for our Postgres password, so we do not need to pass it as plain text in code for security and privacy.

    1. Press Win + S, type “Environment Variables” and open Edit the system environment variables.
    2. In System Properties click Environment Variables…
    3. Under User variables click New…
      • Variable name: PG_PASS
      • Variable value: YourPassword
    4. Click OK through all dialogs.
  3. Open VS Code in Windows

  4. Create a file test_users_query.py and temporarily print out the env variable:

    import os
    print("PG_PASS =", os.getenv("PG_PASS"))
    

    If it works, replace it with the following code.

    python
    import os
    import psycopg2
    
    def main():
       conn = psycopg2.connect(
          host="127.0.0.1",
          port=5432,
          dbname="postgres",
          user="postgres",
          password=os.getenv("PG_PASS")
       )
    
       cur = conn.cursor()
       cur.execute("SELECT id, name, age FROM test_users ORDER BY id;")
       rows = cur.fetchall()
       print("test_users contents:")
       for r in rows:
          print(f"  id={r[0]}, name={r[1]}, age={r[2]}")
    
       cur.close()
       conn.close()
    
    if __name__ == "__main__":
       main()
    

    Note that by default, if you pass host="localhost", psycopg2 may try both UNIX socket and IPv6. Instead, we explicitly use an IP address so it is forced to use an IPv4/TCP connection.

  5. Running this will pull PG_PASS from the global environment. We should see:

       test_users contents:
       id=1, name=Alice, age=30
       id=2, name=Bob,   age=25
       id=3, name=Carol, age=28
    

    With this, we have successfully verified the connection from our Windows dev environment to WSL Postgres, and queried an existing table.

Conclusion

In this blog post, I have illustrated how to setup PostgreSQL as both a relational and vector database in WSL. We perform connection testing to ensure everything works from VS Code Windows. In my next blog post, I am going to do multimodal search against PostgreSQL.