Every few weeks, a team comes to us wanting to add "AI search" to a product, convinced they need a dedicated vector database, a new ops surface, and a cross-system sync story. Sometimes that's true. Most of the time, they have a Postgres database that's doing fine, a few hundred thousand rows of content, and a real use case that pgvector handles without drama.
This is a hands-on walkthrough for shipping semantic search on Postgres with pgvector. We'll cover the schema, the embedding pipeline, the index choices that actually matter, hybrid search (combining semantic with keyword), and the production concerns — reindexing, versioning, cost — that most tutorials skip. If you have Postgres already running and fewer than a few million vectors, this is the path we recommend by default.
When pgvector is the right call
Before the code, the decision. pgvector is the right choice when:
- You already run Postgres and would rather add an extension than stand up a new service.
- Your vector corpus is under a few million rows. pgvector scales further, but the operational simplicity is its main win, and the scaling story gets harder past ~5M rows in a single table.
- You need transactional consistency between your content and your embeddings. (Updating a record and its vector in the same transaction is a very nice property.)
- You want to combine vector similarity with structured filters (
WHERE tenant_id = ...) cleanly.
pgvector is probably the wrong call when: you're past ~10M vectors, your team is dedicated to a specialized stack, or you have a use case that genuinely needs features a vector-native database provides.
Setting up the extension
Assuming a modern Postgres (15+), enabling pgvector is one line:
CREATE EXTENSION IF NOT EXISTS vector;
Most managed Postgres providers support pgvector now — the major cloud providers, Supabase, Neon, and the rest. Check your provider's docs for the currently supported version. Version matters because HNSW indexing was added in 0.5 and has continued to improve since.
Schema — the part that matters most
A schema that holds up under real use:
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
tenant_id UUID NOT NULL,
source_type TEXT NOT NULL, -- 'article' | 'ticket' | 'policy' | ...
source_id TEXT NOT NULL, -- stable external id
title TEXT,
content TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
chunk_index INT NOT NULL,
content TEXT NOT NULL,
token_count INT NOT NULL,
embedding VECTOR(1536), -- dimension depends on your model
embedding_model TEXT NOT NULL, -- e.g. 'text-embedding-3-small'
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (document_id, chunk_index)
);
CREATE INDEX ON document_chunks (tenant_id);
CREATE INDEX ON document_chunks (document_id);
A few decisions worth naming:
1. Chunks, not whole documents. You almost never want to embed an entire long document as one vector. You lose semantic precision — a 2,000-word policy doc compressed to one vector tells you the doc is about the topic, not that section 3.2 answers the specific question. Chunk it.
2. Store embedding_model. When you upgrade models (and you will), you need to know which rows are on which model. Without this column, you can't safely migrate.
3. Keep a tenant_id and index it. For any multi-tenant SaaS, partitioning queries by tenant is non-negotiable. We've seen teams add this in a panic three months after launch when a query started scanning all tenants' vectors. Bake it in on day one.
4. Metadata as JSONB. Store per-chunk facets you'll want to filter on — author, section, publish date, tags. You'll reach for these for hybrid filtering.
The embedding pipeline
The embedding pipeline is a boring, reliable job. Bias toward boring:
# Python pseudocode — use your language of choice
def ingest_document(doc_id: int, content: str, tenant_id: str):
chunks = chunk_text(content, target_tokens=300, overlap=40)
for idx, chunk in enumerate(chunks):
embedding = embedding_client.embed(
model="text-embedding-3-small",
input=chunk.text,
)
db.execute(
"""
INSERT INTO document_chunks
(document_id, tenant_id, chunk_index, content, token_count,
embedding, embedding_model)
VALUES (%s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (document_id, chunk_index) DO UPDATE SET
content = EXCLUDED.content,
token_count = EXCLUDED.token_count,
embedding = EXCLUDED.embedding,
embedding_model = EXCLUDED.embedding_model
""",
(doc_id, tenant_id, idx, chunk.text, chunk.token_count,
embedding, "text-embedding-3-small"),
)
A few notes:
- Batch the embedding calls. Most embedding APIs accept arrays. Sending one-at-a-time is slow and expensive.
- Idempotency matters. The
ON CONFLICTlets you re-ingest safely without duplicating chunks. - Handle failures as jobs, not in-request. Ingestion belongs in a queue, not in the HTTP handler that uploaded the document.
Chunk sizes — what works
Target chunk size around 300–500 tokens with 40–80 tokens of overlap. Smaller chunks (100 tokens) are precise but lose context. Larger chunks (1,000+) capture context but blur the vector. The 300–500 range is a pragmatic default for most English prose.
For code, tables, or highly structured content, chunk on semantic boundaries (function boundaries, sections, table rows) rather than fixed token counts.
The index — HNSW or IVFFlat
pgvector supports two index types. Pick HNSW unless you have a specific reason not to.
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW (Hierarchical Navigable Small World) builds a graph of vectors. Queries walk the graph to find nearest neighbors. It's fast, accurate, and the build is incremental (new rows are added without a full rebuild). The tradeoff is memory — HNSW indexes can be large.
IVFFlat partitions the vector space into clusters. Query time is faster to build and uses less memory, but recall can be lower and it doesn't update as gracefully with new data.
Parameters to know for HNSW:
m— how many connections each node keeps. Default 16 is fine for most use cases.ef_construction— how thoroughly the graph is built. Higher = better index quality, slower build. 64 is a reasonable default; go higher if you have time to spare.ef_search— query-time parameter, set per session. Higher = more accurate, slower. Tune against your eval set.
SET hnsw.ef_search = 40;
Pick the right distance operator. vector_cosine_ops for cosine (what most embedding models recommend), vector_l2_ops for Euclidean, vector_ip_ops for inner product. Your embedding model's docs tell you which.
Querying — the simple version
The basic similarity query:
SELECT
id,
document_id,
content,
1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 10;
A few things to notice:
<=>is cosine distance;1 - distancegives you similarity as a score between 0 and 1.- The
WHERE tenant_id = ...filter is applied before the index walk. This is where havingtenant_idindexed and selective matters. LIMIT 10is load-bearing — without it, you'll scan further than you need to.
Hybrid search — why you almost always want it
Pure semantic search has a real weakness: it finds semantically similar content but can miss exact-term queries. If someone searches for a specific error code, a product SKU, or a proper noun, embeddings often underweight the match relative to what a user expects.
The fix is hybrid search — combine semantic similarity with keyword matching (Postgres full-text search), rank the combined results.
WITH semantic AS (
SELECT
id,
document_id,
content,
1 - (embedding <=> $1::vector) AS semantic_score
FROM document_chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 40
),
keyword AS (
SELECT
id,
document_id,
content,
ts_rank_cd(
to_tsvector('english', content),
plainto_tsquery('english', $3)
) AS keyword_score
FROM document_chunks
WHERE tenant_id = $2
AND to_tsvector('english', content) @@ plainto_tsquery('english', $3)
LIMIT 40
)
SELECT
COALESCE(s.id, k.id) AS id,
COALESCE(s.content, k.content) AS content,
COALESCE(s.semantic_score, 0) AS semantic_score,
COALESCE(k.keyword_score, 0) AS keyword_score,
(0.7 * COALESCE(s.semantic_score, 0)) + (0.3 * COALESCE(k.keyword_score, 0)) AS combined_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY combined_score DESC
LIMIT 10;
The 70/30 weighting is a starting point, not a truth. Tune it against your eval set. For technical or keyword-heavy corpora (documentation, codebases, product catalogs), you'll often want more like 50/50. For narrative content, lean harder on the semantic side.
For more sophisticated combining, look at Reciprocal Rank Fusion (RRF) — it's less sensitive to raw score magnitudes and tends to produce more stable rankings. For the first pass, a weighted sum is fine.
Filtering — the stuff tutorials skip
Real queries have filters. "Find relevant docs in this tenant, published in the last year, in this category, not archived." pgvector handles this, but the query plan matters.
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
WHERE tenant_id = $2
AND (metadata->>'category') = $3
AND created_at > now() - interval '1 year'
ORDER BY embedding <=> $1::vector
LIMIT 10;
Two patterns to know:
1. Pre-filtering. Postgres filters first, then vector-searches. Works well when the filter is very selective (narrows to a small number of rows). The HNSW index is used if the candidate set is still large enough.
2. Post-filtering. Vector-search first, then filter. Works well when filters are weakly selective but you only need a handful of final results. You typically implement this by raising the LIMIT on the inner search and filtering in an outer query.
Test both on your data. Real query planners make real decisions based on real statistics — check EXPLAIN ANALYZE rather than guessing.
Production concerns
Reindexing when you change models
You will eventually upgrade your embedding model — the vendor releases a better one, your evals say the new one is sharper, or you change vendors. The migration path:
- Add a new column, don't replace the old one:
embedding_v2 VECTOR(1024). - Run a background job to populate it.
- Keep the
embedding_modelcolumn per row updated. - Flip reads to the new column once migration is complete.
- Drop the old column.
Doing it in-place creates a window where some rows are on the old model and some on the new, and your similarity scores are meaningless during that window.
Cost
Embedding costs are dominated by two things: how much content you embed at ingestion, and how often you re-embed. Query-time costs (embedding the user's query) are typically a rounding error. Rough planning numbers for a mid-sized corpus (~500K chunks, ~300 tokens each, on a commonly-priced small embedding model): a few hundred dollars to do the initial embedding, then pennies per thousand queries. Re-embedding the full corpus is the expensive event — budget for it when you change models.
Observability
The metrics we track on every production semantic search deployment:
- p50 and p95 query latency
- Recall@10 against a golden set (re-run on every schema or index change)
- Queries that returned zero results
- Queries whose top result had a similarity below a threshold (likely confused user or gap in the corpus)
That last one is gold for content-gap analysis. If users keep asking questions where the best match is a 0.45 similarity, you don't have a search problem — you have a missing-content problem.
When to graduate off pgvector
Signals it's time to look at a dedicated vector store:
- You're past 5M vectors and queries are consistently slow even with tuned HNSW.
- You need features like per-document access control at the vector layer, or exotic index types.
- You have a dedicated ML team that would rather own a specialized stack.
Until then, the operational simplicity of "it's just Postgres" is worth more than the raw performance you might gain elsewhere. We've run production semantic search on pgvector for clients at scales where people told us we'd need something fancier, and haven't regretted it.
If you'd like help scoping or implementing semantic search — on pgvector or otherwise — that work lives inside our AI integration and backend and cloud practices. Or get in touch directly and we can look at your specific corpus and use case.
Closing
Semantic search used to require a new database, a new team, and a new ops story. Now it requires a Postgres extension and a carefully designed schema. The decision isn't whether the technology is ready — it is. The decision is whether your specific use case benefits from it, and whether you can commit to the boring ops work of keeping embeddings fresh, indexes healthy, and evals running. If yes — pgvector is the cleanest path from zero to shipped we've found.
