Skip to main content
DesignKey Studio
Semantic Search Implementation: Step-by-Step with pgvector — featured article image
Development
December 1, 2025
11 min read
By Daniel Killyevo

Semantic Search Implementation: Step-by-Step with pgvector

A production-minded walkthrough for shipping semantic search on Postgres with pgvector — schema, indexes, embedding pipeline, and hybrid retrieval.

pgvectorsemantic-searchpostgresragembeddings

Every few weeks, a team comes to us wanting to add "AI search" to a product, convinced they need a dedicated vector database, a new ops surface, and a cross-system sync story. Sometimes that's true. Most of the time, they have a Postgres database that's doing fine, a few hundred thousand rows of content, and a real use case that pgvector handles without drama.

This is a hands-on walkthrough for shipping semantic search on Postgres with pgvector. We'll cover the schema, the embedding pipeline, the index choices that actually matter, hybrid search (combining semantic with keyword), and the production concerns — reindexing, versioning, cost — that most tutorials skip. If you have Postgres already running and fewer than a few million vectors, this is the path we recommend by default.

When pgvector is the right call

Before the code, the decision. pgvector is the right choice when:

  • You already run Postgres and would rather add an extension than stand up a new service.
  • Your vector corpus is under a few million rows. pgvector scales further, but the operational simplicity is its main win, and the scaling story gets harder past ~5M rows in a single table.
  • You need transactional consistency between your content and your embeddings. (Updating a record and its vector in the same transaction is a very nice property.)
  • You want to combine vector similarity with structured filters (WHERE tenant_id = ...) cleanly.

pgvector is probably the wrong call when: you're past ~10M vectors, your team is dedicated to a specialized stack, or you have a use case that genuinely needs features a vector-native database provides.

Setting up the extension

Assuming a modern Postgres (15+), enabling pgvector is one line:

CREATE EXTENSION IF NOT EXISTS vector;

Most managed Postgres providers support pgvector now — the major cloud providers, Supabase, Neon, and the rest. Check your provider's docs for the currently supported version. Version matters because HNSW indexing was added in 0.5 and has continued to improve since.

Schema — the part that matters most

A schema that holds up under real use:

CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  tenant_id   UUID NOT NULL,
  source_type TEXT NOT NULL,          -- 'article' | 'ticket' | 'policy' | ...
  source_id   TEXT NOT NULL,          -- stable external id
  title       TEXT,
  content     TEXT NOT NULL,
  metadata    JSONB NOT NULL DEFAULT '{}',
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE document_chunks (
  id             BIGSERIAL PRIMARY KEY,
  document_id    BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  tenant_id      UUID NOT NULL,
  chunk_index    INT NOT NULL,
  content        TEXT NOT NULL,
  token_count    INT NOT NULL,
  embedding      VECTOR(1536),          -- dimension depends on your model
  embedding_model TEXT NOT NULL,        -- e.g. 'text-embedding-3-small'
  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (document_id, chunk_index)
);

CREATE INDEX ON document_chunks (tenant_id);
CREATE INDEX ON document_chunks (document_id);

A few decisions worth naming:

1. Chunks, not whole documents. You almost never want to embed an entire long document as one vector. You lose semantic precision — a 2,000-word policy doc compressed to one vector tells you the doc is about the topic, not that section 3.2 answers the specific question. Chunk it.

2. Store embedding_model. When you upgrade models (and you will), you need to know which rows are on which model. Without this column, you can't safely migrate.

3. Keep a tenant_id and index it. For any multi-tenant SaaS, partitioning queries by tenant is non-negotiable. We've seen teams add this in a panic three months after launch when a query started scanning all tenants' vectors. Bake it in on day one.

4. Metadata as JSONB. Store per-chunk facets you'll want to filter on — author, section, publish date, tags. You'll reach for these for hybrid filtering.

The embedding pipeline

The embedding pipeline is a boring, reliable job. Bias toward boring:

# Python pseudocode — use your language of choice
def ingest_document(doc_id: int, content: str, tenant_id: str):
    chunks = chunk_text(content, target_tokens=300, overlap=40)

    for idx, chunk in enumerate(chunks):
        embedding = embedding_client.embed(
            model="text-embedding-3-small",
            input=chunk.text,
        )
        db.execute(
            """
            INSERT INTO document_chunks
              (document_id, tenant_id, chunk_index, content, token_count,
               embedding, embedding_model)
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (document_id, chunk_index) DO UPDATE SET
              content = EXCLUDED.content,
              token_count = EXCLUDED.token_count,
              embedding = EXCLUDED.embedding,
              embedding_model = EXCLUDED.embedding_model
            """,
            (doc_id, tenant_id, idx, chunk.text, chunk.token_count,
             embedding, "text-embedding-3-small"),
        )

A few notes:

  • Batch the embedding calls. Most embedding APIs accept arrays. Sending one-at-a-time is slow and expensive.
  • Idempotency matters. The ON CONFLICT lets you re-ingest safely without duplicating chunks.
  • Handle failures as jobs, not in-request. Ingestion belongs in a queue, not in the HTTP handler that uploaded the document.

Chunk sizes — what works

Target chunk size around 300–500 tokens with 40–80 tokens of overlap. Smaller chunks (100 tokens) are precise but lose context. Larger chunks (1,000+) capture context but blur the vector. The 300–500 range is a pragmatic default for most English prose.

For code, tables, or highly structured content, chunk on semantic boundaries (function boundaries, sections, table rows) rather than fixed token counts.

The index — HNSW or IVFFlat

pgvector supports two index types. Pick HNSW unless you have a specific reason not to.

CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW (Hierarchical Navigable Small World) builds a graph of vectors. Queries walk the graph to find nearest neighbors. It's fast, accurate, and the build is incremental (new rows are added without a full rebuild). The tradeoff is memory — HNSW indexes can be large.

IVFFlat partitions the vector space into clusters. Query time is faster to build and uses less memory, but recall can be lower and it doesn't update as gracefully with new data.

Parameters to know for HNSW:

  • m — how many connections each node keeps. Default 16 is fine for most use cases.
  • ef_construction — how thoroughly the graph is built. Higher = better index quality, slower build. 64 is a reasonable default; go higher if you have time to spare.
  • ef_search — query-time parameter, set per session. Higher = more accurate, slower. Tune against your eval set.
SET hnsw.ef_search = 40;

Pick the right distance operator. vector_cosine_ops for cosine (what most embedding models recommend), vector_l2_ops for Euclidean, vector_ip_ops for inner product. Your embedding model's docs tell you which.

Querying — the simple version

The basic similarity query:

SELECT
  id,
  document_id,
  content,
  1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 10;

A few things to notice:

  • <=> is cosine distance; 1 - distance gives you similarity as a score between 0 and 1.
  • The WHERE tenant_id = ... filter is applied before the index walk. This is where having tenant_id indexed and selective matters.
  • LIMIT 10 is load-bearing — without it, you'll scan further than you need to.

Hybrid search — why you almost always want it

Pure semantic search has a real weakness: it finds semantically similar content but can miss exact-term queries. If someone searches for a specific error code, a product SKU, or a proper noun, embeddings often underweight the match relative to what a user expects.

The fix is hybrid search — combine semantic similarity with keyword matching (Postgres full-text search), rank the combined results.

WITH semantic AS (
  SELECT
    id,
    document_id,
    content,
    1 - (embedding <=> $1::vector) AS semantic_score
  FROM document_chunks
  WHERE tenant_id = $2
  ORDER BY embedding <=> $1::vector
  LIMIT 40
),
keyword AS (
  SELECT
    id,
    document_id,
    content,
    ts_rank_cd(
      to_tsvector('english', content),
      plainto_tsquery('english', $3)
    ) AS keyword_score
  FROM document_chunks
  WHERE tenant_id = $2
    AND to_tsvector('english', content) @@ plainto_tsquery('english', $3)
  LIMIT 40
)
SELECT
  COALESCE(s.id, k.id) AS id,
  COALESCE(s.content, k.content) AS content,
  COALESCE(s.semantic_score, 0) AS semantic_score,
  COALESCE(k.keyword_score, 0) AS keyword_score,
  (0.7 * COALESCE(s.semantic_score, 0)) + (0.3 * COALESCE(k.keyword_score, 0)) AS combined_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY combined_score DESC
LIMIT 10;

The 70/30 weighting is a starting point, not a truth. Tune it against your eval set. For technical or keyword-heavy corpora (documentation, codebases, product catalogs), you'll often want more like 50/50. For narrative content, lean harder on the semantic side.

For more sophisticated combining, look at Reciprocal Rank Fusion (RRF) — it's less sensitive to raw score magnitudes and tends to produce more stable rankings. For the first pass, a weighted sum is fine.

Filtering — the stuff tutorials skip

Real queries have filters. "Find relevant docs in this tenant, published in the last year, in this category, not archived." pgvector handles this, but the query plan matters.

SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
WHERE tenant_id = $2
  AND (metadata->>'category') = $3
  AND created_at > now() - interval '1 year'
ORDER BY embedding <=> $1::vector
LIMIT 10;

Two patterns to know:

1. Pre-filtering. Postgres filters first, then vector-searches. Works well when the filter is very selective (narrows to a small number of rows). The HNSW index is used if the candidate set is still large enough.

2. Post-filtering. Vector-search first, then filter. Works well when filters are weakly selective but you only need a handful of final results. You typically implement this by raising the LIMIT on the inner search and filtering in an outer query.

Test both on your data. Real query planners make real decisions based on real statistics — check EXPLAIN ANALYZE rather than guessing.

Production concerns

Reindexing when you change models

You will eventually upgrade your embedding model — the vendor releases a better one, your evals say the new one is sharper, or you change vendors. The migration path:

  1. Add a new column, don't replace the old one: embedding_v2 VECTOR(1024).
  2. Run a background job to populate it.
  3. Keep the embedding_model column per row updated.
  4. Flip reads to the new column once migration is complete.
  5. Drop the old column.

Doing it in-place creates a window where some rows are on the old model and some on the new, and your similarity scores are meaningless during that window.

Cost

Embedding costs are dominated by two things: how much content you embed at ingestion, and how often you re-embed. Query-time costs (embedding the user's query) are typically a rounding error. Rough planning numbers for a mid-sized corpus (~500K chunks, ~300 tokens each, on a commonly-priced small embedding model): a few hundred dollars to do the initial embedding, then pennies per thousand queries. Re-embedding the full corpus is the expensive event — budget for it when you change models.

Observability

The metrics we track on every production semantic search deployment:

  • p50 and p95 query latency
  • Recall@10 against a golden set (re-run on every schema or index change)
  • Queries that returned zero results
  • Queries whose top result had a similarity below a threshold (likely confused user or gap in the corpus)

That last one is gold for content-gap analysis. If users keep asking questions where the best match is a 0.45 similarity, you don't have a search problem — you have a missing-content problem.

When to graduate off pgvector

Signals it's time to look at a dedicated vector store:

  • You're past 5M vectors and queries are consistently slow even with tuned HNSW.
  • You need features like per-document access control at the vector layer, or exotic index types.
  • You have a dedicated ML team that would rather own a specialized stack.

Until then, the operational simplicity of "it's just Postgres" is worth more than the raw performance you might gain elsewhere. We've run production semantic search on pgvector for clients at scales where people told us we'd need something fancier, and haven't regretted it.

If you'd like help scoping or implementing semantic search — on pgvector or otherwise — that work lives inside our AI integration and backend and cloud practices. Or get in touch directly and we can look at your specific corpus and use case.

Closing

Semantic search used to require a new database, a new team, and a new ops story. Now it requires a Postgres extension and a carefully designed schema. The decision isn't whether the technology is ready — it is. The decision is whether your specific use case benefits from it, and whether you can commit to the boring ops work of keeping embeddings fresh, indexes healthy, and evals running. If yes — pgvector is the cleanest path from zero to shipped we've found.

Share this article

Author
DK

Daniel Killyevo

Founder

Building cutting-edge software solutions for businesses worldwide.

Contact Us

Let's have a conversation!

Fill out the form, and tell us about your expectations.
We'll get back to you to answer all questions and help to chart the course of your project.

How does it work?

1

Our solution expert will analyze your requirements and get back to you in 3 business days.

2

If necessary, we can sign a mutual NDA and discuss the project in more detail during a call.

3

You'll receive an initial estimate and our suggestions for your project within 3-5 business days.