In my previous blog posts, I have run image embedding similarity search using both a ChromaDB collection and a PostgreSQL database with pgvector extension. This post will analyze key performance metrics (e.g., query latency, throughput, recall, MRR, etc) to evaluate these 2 vector database engines. For pgvector, we will also benchmark performance with and without an index.

Vector similarity search, often referred to as “dense retrieval,” has become a foundational component in many AI applications. It identifies semantically similar items (e.g., text, images, or audio) based on their numerical vector representations (embeddings) in a high-dimensional vector space. In this vector space, items with similar meanings or characteristics are mapped in closer positions. We have already seen in my previous 2 posts on using vector search for rapid identification of images that carry similar content.

ChromaDB vs. pgvector

ChromaDB is an open-source, developer-friendly vector database designed for efficient storage and retrieval of vector data. It is well-regarded for its minimal setup requirements and ease of use via a very simple and straightforward API, making it an excellent choice for prototyping and local development. ChromaDB can operate locally, leveraging in-memory operations for speed. It simplifies the embedding management process by automatically converting data into embeddings using specific embedding models or by allowing users to provide pre-computed embeddings.

pgvector is an open-source extension for PostgreSQL, enabling the storage and querying of vector embeddings directly within a traditional relational database. This integration allows users to leverage PostgreSQL’s robust features, including SQL compatibility, transactional support, and established data management capabilities, making it an attractive option for those already operating within the PostgreSQL ecosystem.

Key performance metrics

We will evaluate vector search performance by examining a system’s efficiency (latency, throughput) and effectiveness (accuracy) in retrieving relevant information.

Speed

query latency

Query latency is defined as the total time elapsed from the initiation of a single similarity search query until its results are fully returned.

Below are the latency metrics we will use to compare Chroma vs. pgvector:

MetricMeaning
avg_msMean latency per query (in milliseconds). Sum of all query times divided by count. Gives a single overall sense of speed, but can be skewed by outliers.
p50_msMedian latency (50th percentile). Half of queries are faster, half slower.
p95_ms95th percentile latency. 95% of queries complete within this time; measures high‐tail performance.
p99_ms99th percentile latency. Captures the very slowest 1% of queries.
std_msStandard deviation of latencies (in ms). A low std means consistent performance; a high std means some queries are much slower or faster than the average.

Both p95_ms & p99_ms are crucial for a thorough understanding of performance, especially worst‐case behavior. They provide a more robust and realistic view of performance under load than average latency alone.

throughput

Throughput, expressed as Queries Per Second (QPS), quantifies the maximum number of queries a system can process and return results per second.

A high QPS indicates that the system can efficiently manage a heavy workload without becoming a bottleneck, ensuring that the application remains responsive even during peak demand.

Throughput is typically assessed by executing a large volume of queries, either sequentially or concurrently, and then dividing the total number of queries by the total time taken for their execution.

When comparing engines:

  • If two systems have similar avg_ms but one has much higher p99_ms, the latter will suffer occasional spikes that could degrade user experience.
  • A higher QPS at similar latencies means a more efficient engine under load.
  • Monitoring std_ms alongside percentiles helps us spot instability even if the mean looks good.

Accuracy

Below are the accuracy metrics we will use to compare Chroma vs. pgvector:

MetricMeaning
Recall@KMeasures the fraction of all relevant items retrieved in the top-K results. If the ground‐truth has 10 nearest neighbors and we only retrieve 9 of them in the top-10, recall@10 = 9/10 = 0.9. Recall is crucial because it directly reflects the relevance and completeness of the search results. If Recall@K is low (< 90 %), it might mean that the index parameters is too aggressive.
Hit@KBinary measure: 1 ifat least one true neighbor is in the top-K, else 0.
MAP@K (Mean Average Precision)Averages the precision scores at every rank (not just the first), where a true neighbor appears, up to K.
nDCG@K (Normalized Discounted Cumulative Gain)Weights early hits more heavily but still rewards multiple correct items.
MRR (Mean Reciprocal Rank)Focuses on the rank of the first relevant item. If the very first result is always relevant, each reciprocal rank = 1/(1) = 1.0, so average MRR = 1.0—even if the remaining nine predictions are wrong. In general, we should use MRR when we care about getting at least one correct answer immediately (e.g., QA systems, first-click satisfaction). Whereas Recall@K are deployed to scenarios when the breadth of retrieval (e.g., recommendation systems, multi‐item matching) matters

Latency-Recall trade-off

A fundamental principle in vector search is the inherent compromise between speed (latency and throughput) and accuracy (recall). Generally, achieving higher recall comes at the cost of increased latency or reduced throughput, especially as systems approach near-perfect recall.

This trade-off is best understood by distinguishing between Exact Nearest Neighbor (ENN) and Approximate Nearest Neighbor (ANN) search algorithms:

  • Exact Nearest Neighbor (ENN): An ENN search algorithm performs a brute-force comparison of the query vector against every single vector in the entire dataset. This exhaustive method guarantees finding the true nearest neighbors, thereby achieving 100% recall. However, due to its computational intensity, it becomes prohibitively slow and resource-demanding for large datasets. An unindexed pgvector table will employ this algorithm with a sequential scan to computes the distance between the query vector and every vector row in the table. It can thus be used as the ground truth source for accuracy.
  • Approximate Nearest Neighbor (ANN): In contrast, ANN algorithms (such as HNSW, IVF, or LSH) are designed to sacrifice a small amount of recall in exchange for significantly faster search times. They achieve this by limiting the search scope, exploring only a subset of the vector space, or employing data structures that facilitate quicker approximate matches. These algorithms often expose tunable parameters (e.g., efSearch in HNSW, nprobe in IVF) that allow developers to fine-tune the balance between accuracy and speed according to application needs. Chroma uses ANN to achieve high performance and scalability.

Having this ground truth in an unindexed table, we can then implement an index with identical content in a new table that duplicates the default configuration of ChromaDB, to give us level ground in comparing both latency and accuracy.

Experimental design

Local environment

We will run this experiment with both ChromaDB and PostgreSQL with pgvector operating locally on the same machine. This co-located setup is critical as it minimizes external variables, such as network latency, which could otherwise obscure the direct comparison of the database systems’ intrinsic performance. While the absolute performance numbers (e.g., specific milliseconds or QPS values) will naturally depend on the local machine’s hardware specifications (CPU, RAM, SSD/HDD), the relative performance differences observed between ChromaDB and pgvector will remain indicative and generalizable across similar local environments.

Query data

To ensure that the benchmark results are representative of real-world usage, a set of query vectors will be randomly selected directly from the existing image embeddings in use.

For speed test, we will use the precomputed image embeddings for the Unsplash Lite dataset. We will uses np.random.default_rng(seed) for reproducibility. They will be used to search against the 3 data source (Chroma, unindexed pg table, indexed pg table).

For accuracy test, we will pull the query from the non-index PostgreSQL database and then search its neighbor to establish a ground truth.

A sufficient number of queries (e.g., 1,000 to 5,000 for latency and throughput measurements in multiple runs, and 50-100 for recall assessment) will be executed to ensure statistically significant average and percentile measurements.

Measurement

speed

Each similarity search query will be timed individually using time.perf_counter(). This function provides the highest available resolution timer in Python, making it ideal for benchmarking and accurately capturing the execution time of potentially very fast operations. Less precise timers, such as time.time(), could introduce significant measurement noise or even show zero elapsed time for very quick operations, leading to inaccurate or misleading performance comparisons.

Measurements will be collected for 10 runs of 100 and 1000 queries, and statistical aggregates including average, median, P95 & P99 latency, and QPS will be calculated to provide a robust understanding of the performance distribution.

accuracy

Recall accuracy will be assessed in this way:

  • Ground Truth Determination: For each query vector, the “true” nearest neighbors will be identified by performing a similarity search in the unindexed pgvector table with Exact Nearest Neighbor (ENN) search via a sequential scan of all vectors.
  • Recall@k Calculation: The top k results (identified by their unique IDs) returned by ChromaDB and the indexed pgvector table will be compared against the ground truth query vectors’ results. Recall@k will then be calculated as the proportion of true top k neighbors (from pgvector) that are present within ChromaDB’s and the indexed pgvector table’s top k results.
  • We will also evaluate multiple metrics (Recall@K, MRR, MAP@K, nDCG@K, Hit@1) to capture both “first hit” and “overall coverage.”

For this experiment, we will use K from 1 to 10. A small K (1–5) tests if we get the closest matches right. A larger K (50–100) probes whether our index preserves deeper structure in the embedding space.

Benchmark code

shared components

We use Python environment to conduct this benchmark. For library modules, chromadb is necessary for interacting with the ChromaDB collection, while psycopg2 is essential for connecting to PostgreSQL and executing queries against pgvector. numpy is used for numerical operations, handling embeddings and calculating recall.

pgvector with HNSW index

We will create a new table with the same precomputed embeddings and metadata, following the PostgreSQL post. For fair comparison, we will create an index using Chroma's default values, or we can query our existing collection to get the values by

import chromadb
client = chromadb.PersistentClient(path='.')
col = client.get_collection("unsplash-lite")
print(col.__dict__)

Specifically, we will build

  • a HNSW index
  • use Euclidean (ℓ₂) distance to calculate the distance between vector embeddings
  • m=16: max neighbors per node
  • ef_construction=100: a high ef_construction and m connects more nodes together when building an HNSW graph, resulting in higher recall but longer build time.
  • ef_search=100: controls the search breadth in the HNSW graph. A higher ef_search explores more nodes at query time, before picking the top-K. Thus it will increases recall and use extra memory at the cost of latency.
CREATE INDEX ON images_index
USING hnsw (embedding vector_l2_ops)
WITH (
  m = 16,
  ef_construction = 100,
  ef_search:100
);

In addition to setting ef_search at index build time, we can also alter an existing index by

ALTER INDEX images_index_idx
  SET (ef_search = 64);

Or tune it before running our nearest‐neighbor queries in code by setting the session variable for the beam width after establishing the connection and before issuing queries:

SET vector.hnsw_ef_search = 64;

or

cur.execute("SET vector.hnsw_ef_search = %s;", (64,))

Speed

We first load all our libraries and connection constants, as well as test configurations.

import os
import time
import pickle
import numpy as np
import psycopg2
import chromadb 

# === Configuration ===
PG_DSN = dict(
    host="127.0.0.1",
    port=5432,
    dbname="imagesdb",
    user="postgres",
    password=os.getenv("PG_PASS")
)
PG_TABLE          = "images_index"
PG_ID_COL         = "photo_id"
PG_VECTOR_COL     = "embedding"

CHROMA_DB_PATH    = "."
CHROMA_COLLECTION = "unsplash-lite"

Q        = 100   # number of queries
K        = 10    # top-K results
SEED     = 42    # for reproducible sampling
N_RUNS   = 10    # repeat each benchmark this many times

Then we load our precomputed embeddings from the pickle file, and generate a random list of queries.

with open("u25emb.pkl", "rb") as f:
    img_names, img_emb = pickle.load(f)
    # img_emb: numpy array of shape [N, D]

rng    = np.random.default_rng(SEED)
indices = rng.choice(len(img_emb), size=Q, replace=False)
queries = [img_emb[i].tolist() for i in indices]

We now define our benchmark functions

def benchmark_chroma(queries, k, n_runs):
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH,
                                       settings=Settings())
    col = client.get_collection(CHROMA_COLLECTION)
    # one warm‐up query
    col.query(query_embeddings=[queries[0]], n_results=k)

    times = []
    for run in range(n_runs):
        for q in queries:
            t0 = time.perf_counter()
            col.query(query_embeddings=[q], n_results=k)
            times.append(time.perf_counter() - t0)
    return np.array(times)

def benchmark_pgvector(queries, k, n_runs, dsn, table):
    conn = psycopg2.connect(**dsn)
    conn.autocommit = True
    cur = conn.cursor()

    # one warm‐up query
    q0 = "[" + ",".join(map(str, queries[0])) + "]"
    cur.execute(
        f"SELECT {PG_ID_COL} FROM {table} "
        f"ORDER BY {PG_VECTOR_COL} <-> %s::vector LIMIT %s;",
        (q0, k),
    )
    cur.fetchall()

    times = []
    for run in range(n_runs):
        for q in queries:
            q_str = "[" + ",".join(map(str, q)) + "]"
            t0 = time.perf_counter()
            cur.execute(
                f"SELECT {PG_ID_COL} FROM {table} "
                f"ORDER BY {PG_VECTOR_COL} <-> %s::vector LIMIT %s;",
                (q_str, k),
            )
            cur.fetchall()
            times.append(time.perf_counter() - t0)

    conn.close()
    return np.array(times)

We compute our metrics, run the benchmarks and summarize the result

# === Metrics computation ===
def compute_metrics(arr):
    return {
        "avg_ms": arr.mean()  * 1000,
        "p50_ms": np.percentile(arr, 50) * 1000,
        "p95_ms": np.percentile(arr, 95) * 1000,
        "p99_ms": np.percentile(arr, 99) * 1000,
        "std_ms": arr.std()  * 1000,
        "qps":    len(arr) / arr.sum()
    }

# === Run benchmarks ===
print(f"Running each engine {N_RUNS}× over {Q} queries…")
times_chroma = benchmark_chroma(queries, K, N_RUNS)
times_pg     = benchmark_pgvector(queries, K, N_RUNS, PG_DSN,"images_index")
times_pg_noidx     = benchmark_pgvector(queries, K, N_RUNS, PG_DSN,"images")

# === Summarize ===
metrics_chroma = compute_metrics(times_chroma)
metrics_pg     = compute_metrics(times_pg)
metrics_pg_noidx     = compute_metrics(times_pg_noidx)

engines = [
    ("Chroma",             metrics_chroma),
    ("pgvector – idx",     metrics_pg),
    ("pgvector – noidx",   metrics_pg_noidx),
]

print("\n| Engine             | avg_ms | p50_ms | p95_ms | p99_ms | std_ms |   QPS  |")
print("|:-------------------|-------:|-------:|-------:|-------:|-------:|-------:|")

for name, m in engines:
    print(
        f"| {name:<18} "
        f"| {m['avg_ms']:7.2f} "
        f"| {m['p50_ms']:7.2f} "
        f"| {m['p95_ms']:7.2f} "
        f"| {m['p99_ms']:7.2f} "
        f"| {m['std_ms']:7.2f} "
        f"| {m['qps']:7.1f} |"
    )

We get this output

Running each engine 10× over 100 queries…

| Engine   | avg_ms | p50_ms | p95_ms | p99_ms | std_ms |  QPS  |
|----------|-------:|-------:|-------:|-------:|-------:|------:|
| Chroma   |    1.64 |    1.26 |    3.46 |    5.65 |    0.99 |  609.9 |
| pgvector - idx |    1.40 |    1.09 |    2.61 |    6.67 |    1.07 |  716.4 |
| pgvector - noidx |   49.06 |   48.11 |   58.59 |   65.64 |    5.10 |   20.4 |

When we change to 1000 queries per run

Running each engine 10× over 1000 queries…

| Engine             | avg_ms | p50_ms | p95_ms | p99_ms | std_ms |   QPS  |
|:-------------------|-------:|-------:|-------:|-------:|-------:|-------:|
| Chroma             |    1.32 |    1.10 |    2.50 |    3.45 |    0.60 |   755.7 |
| pgvector – idx     |    1.54 |    1.15 |    2.77 |    5.72 |    2.88 |   648.5 |
| pgvector – noidx   |   51.38 |   48.27 |   67.41 |   91.16 |    9.01 |    19.5 |

intepreting results

  1. Warm-up effects

    Full warm-up (thousands of queries) is essential to see true latencies.

    • With only 100 queries per run, a much larger fraction of those queries still pay one-off cold costs (disk reads, query plan overhead, text parsing).
    • Bumping to 1000 queries per run pushes us further into the a steady state, so the cold‐query tail has less weight in our averages and percentiles.
  2. Chroma smooths out faster than pgvector

    • 100‐query run → 1000-query run

      • Chroma avg 1.64 ms → 1.32 ms, p99 5.65 → 3.45 ms
      • pgvector-idx avg 1.40 ms → 1.54 ms, p99 6.67 → 5.72 ms

      For long steady streams (1000+ queries) Chroma is both faster on average and more predictable. This can probably be attributed to Chroma’s keeping both index and embeddings in memory, eliminating disk seeks and serialization overhead. Whereas Postgres suffers overhead in parsing, planning, and executing SQL for every query.

  3. pgvector-idx vs. Chroma at scale

    In the long 1000-query runs, Chroma edges out pgvector-idx (1.32 ms vs. 1.54 ms avg), and its tail latencies (p95/p99) comes ahead more (2.50/3.45 ms vs. 2.77/5.72 ms).

    That suggests at high QPS, Chroma’s internal in‐memory graph traversal is a bit more consistent than Postgres’s HNSW variant, which still has occasional heavier probes or buffer cache churn.

  4. Unindexed sequential scans is not feasible

    pgvector no-idx sits at ~50 ms/query (QPS ~20) whether it’s 100 or 1000 queries. That’s exactly why we have to build an index.

  5. We have to always include p95/p99 alongside avg and p50, as tails behave differently as we scale.

Accuracy

In this test, we use the original postgres table without index as the ground truth. We will pull 50 random embeddings from this table and find the top 10 neighbors for each.

Then we will query both Chroma and the indexed postgres table with these 50 embeddings.

We will again begin by declaring constants and importing libraries.

import os
import pickle
import time
import ast
import numpy as np
import psycopg2
import chromadb

# Config
PG_HOST, PG_PORT  = "127.0.0.1", 5432
PG_DB,   PG_USER  = "imagesdb", "postgres"
PG_PASS           = os.getenv("PG_PASS", "")
GT_TABLE          = "images"          # ground truth table  
IDX_TABLE         = "images_index"    # HNSW-indexed table
PG_COL            = "embedding"
PG_ID_COL         = "photo_id"

CHROMA_PATH       = "."
CHROMA_COLLECTION = "unsplash-lite"

Q_SIZE, TOP_K     = 50, 10  

We will then connect to the three data sources and retrieve the query embeddings from the non-index table.

# Connect to Postgres
conn = psycopg2.connect(
    host=PG_HOST, port=PG_PORT, dbname=PG_DB,
    user=PG_USER, password=PG_PASS
)
cur = conn.cursor()

# Fetch Q_SIZE embeddings from ground-truth table
cur.execute(f"SELECT {PG_ID_COL}, {PG_COL} FROM {GT_TABLE} LIMIT %s", (Q_SIZE,))
rows = cur.fetchall()
ids, raw_embs = zip(*rows)

# Parse once: raw_embs is like ['[0.1, 0.2, …]', …]
emb = np.array([ast.literal_eval(s) for s in raw_embs], dtype=float)
print("emb shape:", emb.shape, "dtype:", emb.dtype)

# Initialize Chroma collection 
chroma = chromadb.PersistentClient(path=CHROMA_PATH)
coll   = chroma.get_or_create_collection(CHROMA_COLLECTION)

We will then query against all 3 data sources and store their findings.

# Metrics storage
gt_preds, idx_preds, ch_preds = [], [], []

for q_id, q_vec in zip(ids, emb):
    pg_vec = q_vec.tolist()

    # --- Ground-truth from unindexed table (seq scan) ---
    cur.execute(
        f"SELECT {PG_ID_COL} FROM {GT_TABLE} "
        f"ORDER BY {PG_COL} <-> %s::vector LIMIT %s;",
        (pg_vec, TOP_K)
    )
    gt = [r[0] for r in cur.fetchall()]
    gt_preds.append(gt)

    # --- Indexed table search ---
    cur.execute(
        f"SELECT {PG_ID_COL} FROM {IDX_TABLE} "
        f"ORDER BY {PG_COL} <-> %s::vector LIMIT %s;",
        (pg_vec, TOP_K)
    )
    idx = [r[0] for r in cur.fetchall()]
    idx_preds.append(idx)

    # --- Chroma search ---
    resp = coll.query(query_embeddings=[q_vec], n_results=TOP_K)
    ch_preds.append(resp["ids"][0])

We compare Chroma and indexed Postgres table’s findings with ground truth to calculate various recall, Hit, MAP and nDCG metrics, as well as MRR.

import numpy as np

def recall_at_k(preds, truths, k):
    """
    preds, truths: lists of lists of IDs
    k: cutoff
    """
    scores = []
    for p, t in zip(preds, truths):
        retrieved = set(p[:k])
        relevant  = set(t[:k])
        scores.append(len(retrieved & relevant) / float(len(relevant)))
    return np.mean(scores)

def hit_at_k(preds, truths, k):
    """
    1 if at least one true ID in top-k, else 0.
    """
    hits = []
    for p, t in zip(preds, truths):
        hits.append(int(bool(set(p[:k]) & set(t[:k]))))
    return np.mean(hits)

def average_precision(pred, truth, k):
    """
    AP@k for a single query.
    """
    num_rel = 0
    score   = 0.0
    relevant = set(truth[:k])
    for i, pid in enumerate(pred[:k], start=1):
        if pid in relevant:
            num_rel += 1
            score += num_rel / i
    if not relevant:
        return 0.0
    return score / len(relevant)

def mean_average_precision(preds, truths, k):
    """MAP@k over all queries."""
    ap_scores = [average_precision(p, t, k) for p, t in zip(preds, truths)]
    return np.mean(ap_scores)

def dcg_at_k(pred, truth, k):
    """
    DCG@k: sum_{i=1..k} relevance_i / log2(i+1)
    where relevance_i is 1 if pred[i-1] in truth[:k] else 0.
    """
    rel = [1 if pid in set(truth[:k]) else 0 for pid in pred[:k]]
    return sum(r / np.log2(idx + 2) for idx, r in enumerate(rel))

def idcg_at_k(truth, k):
    """
    Ideal DCG@k: all relevant items at top.
    If there are R relevant items (len(truth[:k])), IDCG = sum_{i=1..R} 1/log2(i+1)
    """
    r = len(truth[:k])
    return sum(1.0 / np.log2(i + 2) for i in range(r))

def ndcg_at_k(preds, truths, k):
    """Mean nDCG@k over all queries."""
    scores = []
    for p, t in zip(preds, truths):
        dcg  = dcg_at_k(p, t, k)
        idcg = idcg_at_k(t, k)
        scores.append(dcg / idcg if idcg > 0 else 0.0)
    return np.mean(scores)

def mrr(preds, truths):
    """
    Mean Reciprocal Rank: 1 / (rank of first relevant item)
    """
    rr = []
    for p, t in zip(preds, truths):
        rank = next((i for i, pid in enumerate(p, start=1) if pid in set(t)), None)
        rr.append(1.0 / rank if rank else 0.0)
    return np.mean(rr)

def evaluate_all(preds, truths, ks=(1, 5, 10)):
    """
    Returns a dict of metrics for given cutoff ks.
    """
    out = {}
    for k in ks:
        out[f"Recall@{k}"] = recall_at_k(preds, truths, k)
        out[f"Hit@{k}"]    = hit_at_k(preds, truths, k)
        out[f"MAP@{k}"]    = mean_average_precision(preds, truths, k)
        out[f"nDCG@{k}"]   = ndcg_at_k(preds, truths, k)
    out["MRR"] = mrr(preds, truths)
    return out


metrics_idx   = evaluate_all(idx_preds, gt_preds, ks=(1,5,10))
metrics_chroma = evaluate_all(ch_preds, gt_preds, ks=(1,5,10))

import pandas as pd

# Build DataFrame: rows=metrics, cols=systems
df = pd.DataFrame(
    [metrics_idx, metrics_chroma],
    index=["Postgres IDX", "Chroma"]
).T

# Print GitHub-flavored Markdown
print(df.to_markdown(floatfmt=".3f"))

We get this result

|           |   Postgres IDX |   Chroma |
|:----------|---------------:|---------:|
| Recall@1  |          1.000 |    1.000 |
| Hit@1     |          1.000 |    1.000 |
| MAP@1     |          1.000 |    1.000 |
| nDCG@1    |          1.000 |    1.000 |
| Recall@5  |          1.000 |    0.944 |
| Hit@5     |          1.000 |    1.000 |
| MAP@5     |          1.000 |    0.931 |
| nDCG@5    |          1.000 |    0.957 |
| Recall@10 |          0.998 |    0.930 |
| Hit@10    |          1.000 |    1.000 |
| MAP@10    |          0.998 |    0.902 |
| nDCG@10   |          0.999 |    0.944 |
| MRR       |          1.000 |    1.000 |

intepreting results

  • At K=1, both systems are perfect. Every query’s top-1 result is correct.
  • By K=5, Chroma misses ~5.6% of the true neighbors (recall@5 = 0.944), though it still returns at least one correct item each time (Hit@5 = 1.0).
  • At K=10, Postgres retrieves almost all (recall@10 = 0.998) while Chroma finds about 93% of them.
  • MAP and nDCG mirror recall trends but also reflect how well the correct items are ordered: Chroma’s nDCG@10 = 0.944 shows its ranking is about 94.4% as good as ground truth.
  • With MRR = 1.0 for every single query, both engines’ #1 result was exactly the true nearest neighbor (the same as brute-force). But Postgres spreads more of the remaining true neighbors across the subsequent ranks (as can be seen from a higher Recall@10 for Postgres vs Chroma, which is missing more of the other relevant neighbors in positions 2–10)

It’s worth noting that since we are using a small dataset (25k vectors)& dimensionality (512), with default HNSW settings, approximate search has more than enough connectivity to hit the true neighbors. Therefore, we are seeing extreme high accuracy in both engines.

Conclusion

This small experiment shows that the in-memory implementation of Chroma shines in latency and throughput performance. It still maintains a very high accuracy, but is overtaken by pgvector with an HNSW index replicating its default configuration. It will be interesting to observe how tweaking the index for both engines, using concurrent loads, alternating the warm-up query size, or bumping up K to query deeper neighbors, and measuring memory consumptions will alter the findings. That will be my next focus! Stay tuned! 😊