Jordan: I got a question last week that I've been chewing on. Someone in the community asked — and this is almost word for word — "I built a RAG chatbot for a client's internal docs. It worked great in the demo. Two months later the client says it's giving wrong answers and they want out. What happened?"
Jordan: I'll tell you what happened. The same thing that happens to almost every private RAG deployment I've seen fail. The docs changed. The embeddings didn't.
Jordan: The client updated their return policy in January. They revised their onboarding guide in February. They added a whole new product line in March. And the chatbot was still answering questions based on October's versions of those documents. Confidently. With citations.
Jordan: Stale embeddings are bad enough. But this person had a second problem they didn't even know about. Their corpus was too wide. They'd dumped everything — HR policies, engineering specs, marketing copy, meeting notes — into one big vector store. So when a customer asked about shipping timelines, the retrieval was pulling chunks from an internal engineering standup because the word "delivery" appeared in both contexts.
Jordan: Wrong answers from stale data. Irrelevant answers from an unbounded corpus. And no check — nothing — between the retrieval step and the LLM call that could have caught either problem before the model started hallucinating with confidence.
Jordan: That's three failures. And they're all fixable with infrastructure you can set up in an afternoon.
Jordan: By the end of this episode you'll have a production Supabase pgvector RAG setup — bounded corpus, automatic re-embeds that fire when your source documents change, and retrieval guardrails that block the LLM call when the match isn't good enough. I'm walking through the schema, the indexes, the refresh pipeline in Make or n8n, and the exact SQL query pattern that gates bad retrievals before they become bad answers. I'm Jordan. This is Headcount Zero.
Jordan: So let's talk about why private RAG breaks. Not in theory — in the actual client projects you and I are shipping.
Jordan: The tutorials make it look clean. Ingest docs, chunk them, embed them, store the vectors, query with cosine similarity, pass the top results to the LLM. Done. And it does work — in the demo. With a small, static corpus. Where every document is relevant and nothing has changed since you embedded it.
Jordan: Production is different. In production, your client's Notion workspace has three hundred pages and forty of them are outdated drafts nobody deleted. Their Google Drive has duplicate versions of the same SOP with slightly different wording. And the documents that actually matter — the ones customers ask about — get updated every few weeks without anyone telling you.
Jordan: So your RAG system is doing exactly what you built it to do. It's finding the most semantically similar chunks. The problem is that "most similar" might be a chunk from a draft document that was never meant to be public, or a chunk from a version of the return policy that changed two months ago. The retrieval is working. The retrieval is just wrong.
Jordan: And the second failure — the unbounded corpus — is even sneakier. When you dump everything into one vector store, you're telling the system that every document is equally valid context for every question. That's almost never true. A client's support chatbot doesn't need access to their internal hiring docs. But if those docs are in the corpus and they happen to share vocabulary with a customer question, they'll show up in the results. And the LLM will use them. It doesn't know the difference.
Jordan: So the fix isn't a better model. It's not a bigger context window. It's three things: bound your corpus, keep your embeddings fresh, and check the retrieval before you send it to the LLM. That's the whole architecture. Everything else is implementation detail.
Jordan: Let me walk through the implementation.
Jordan: I'm building this on Supabase with pgvector. If you're already running Supabase for anything — auth, storage, Postgres — this is the move, because your vectors live in the same database as your application data. One connection string. One set of Row Level Security policies. No extra service to manage.
Jordan: First thing — enable the extensions. You need pgvector for the vector column type, pgcrypto for generating UUIDs and content hashes, and pg_cron for scheduled jobs. If you want the full list, there's a SQL block in the starter pack that enables everything in one shot. But those three are the ones that matter.
Jordan: Your schema has two core tables. Documents and chunks. Documents holds the source — the full text, a title, a source URL, and critically, a content hash. That hash is a SHA-256 of the raw text. It's how you know whether a document has changed since the last time you processed it.
Jordan: Chunks is where the embeddings live. Each chunk belongs to a document, has an ordinal position so you know the order, the chunk text itself, and a vector column. For OpenAI's text-embedding-3-small, that's a vector with fifteen thirty-six dimensions. Each chunk also gets its own content hash — same idea, different granularity.
Jordan: Now here's where people ask me about index choice, and honestly this is simpler than it sounds. pgvector gives you two approximate nearest neighbor index types — HNSW and IVFFlat. HNSW gives you better recall, meaning it finds more of the truly relevant results, but it uses more memory and takes longer to build. IVFFlat builds faster and uses less memory but you have to tune the number of lists and probes.
Jordan: For most of us — and I mean anyone with fewer than a few hundred thousand chunks — you don't even need an approximate index yet. Exact search works fine at that scale. I have a client with about eight thousand chunks and the query comes back in under fifty milliseconds without any approximate index at all.
Jordan: But when you do add one, start with HNSW. Set m to sixteen, efconstruction to sixty-four. Those are the pgvector README defaults and they're solid for corpora up to a few hundred thousand rows. You can tune efsearch later if latency matters — start around forty.
Jordan: The other index you absolutely need is a GIN index on a tsvector column. This is not for vector similarity — this is for keyword matching. Plain old Postgres full-text search. And it's the first layer of your retrieval guardrails, which I'll get to in a minute.
Jordan: Okay, this is the part that most RAG tutorials skip entirely, and it's the part that actually determines whether your system gives good answers or garbage.
Jordan: The retrieval query has three gates. Three checks that happen before any context gets sent to the LLM.
Jordan: Gate one — lexical overlap. Before you even touch the vector index, you run a tsvector keyword match against the query. This is your GIN index doing the work. If the user's question doesn't share any meaningful keywords with a chunk, that chunk gets filtered out. This is cheap — milliseconds — and it eliminates the "engineering standup matching on the word delivery" problem I described earlier. Semantic similarity alone can't catch that. Keywords can.
Jordan: Gate two — cosine distance cap. pgvector gives you the cosine distance operator — the "less than equals greater than" arrow — and you can set a maximum distance threshold. If the best matching chunk is still too far away from the query vector, you don't use it. Period.
Jordan: Now — and this is important — there is no universal magic number for this threshold. It depends on your corpus, your embedding model, your domain. I start at zero point two for cosine distance and then I log every query that gets rejected. If I'm rejecting too many legitimate queries, I loosen it to zero point two two. If I'm letting through garbage, I tighten to zero point one eight. The threshold is a tunable, not a constant. Treat it that way.
Jordan: Gate three — the fallback. If a query passes zero chunks through both gates, you do not send an empty context to the LLM and let it hallucinate. You return a structured "no confident match" response. Something like "I don't have enough information to answer that reliably. Here's where you might find it." This is the gate that protects your client's trust. A wrong answer is worse than no answer. Always.
Jordan: And by the way — OpenAI's embeddings are L2-normalized. That means cosine similarity and dot product give you identical rankings. So if you see someone arguing about which distance operator to use with OpenAI embeddings, the answer is it doesn't matter. Pick cosine for readability, move on.
Jordan: So you've got your schema, your indexes, your retrieval gates. Now the question is — what happens when the source documents change?
Jordan: This is where I see the most "we'll deal with it later" energy. And "later" means never. The embeddings go stale, the answers drift, and three months in your client is asking why the chatbot doesn't know about the new product line they launched in February.
Jordan: The fix is a two-layer refresh system. Layer one is change-driven. Every time a document gets re-ingested — whether that's through a webhook from Notion, a scheduled pull from Google Drive, whatever — you hash the new content and compare it to the stored content hash. If the hash is different, the document changed. You re-chunk it, and you enqueue only the chunks whose individual hashes changed for re-embedding.
Jordan: That "only the changed chunks" part matters. Because if a client updates one paragraph in a fifty-page document, you don't want to re-embed all two hundred chunks. You want to re-embed the three that actually changed. The content hash on each chunk makes this a simple comparison.
Jordan: Layer two is scheduled. A nightly pg_cron job sweeps for chunks that either have no embedding, have an embedding older than their last update timestamp, or haven't been re-embedded in thirty days. It enqueues them into an embed queue table, and an Edge Function drains that queue in batches — pulls the chunk text, calls the embedding API, writes the vector back.
Jordan: And the cost of this? Genuinely trivial. OpenAI's text-embedding-3-small runs two cents per million tokens. A thousand chunks at five hundred tokens each is half a million tokens. That's one cent. One cent to re-embed your entire corpus. Even if you ran a full refresh every single night — which you wouldn't, because the change-hash system means you're only re-embedding what actually changed — you're looking at maybe thirty cents a month for a typical client knowledge base.
Jordan: The Supabase docs actually publish an automatic embeddings pattern that uses this exact approach — Edge Functions, pg_cron, and a queue table. I've adapted it for the Make and n8n ingestion blueprints in the starter pack, but the core idea is straight from their engineering team: treat embeddings as derived data. When the source changes, the embedding refreshes. Automatically.
Jordan: Let me give you the numbers so you can quote a client confidently.
Jordan: Supabase Pro includes eight gigabytes of database disk per project. Additional disk is twelve and a half cents per gigabyte. Each vector row — at fifteen thirty-six dimensions — takes about six kilobytes in pgvector's vector type. That's before index overhead, but it gives you a clean baseline.
Jordan: So ten thousand chunks at six K each is roughly sixty megabytes of vector data. Add the index — call it another sixty to a hundred megs depending on HNSW parameters — and you're well under a gigabyte. You could run a hundred thousand chunks and still fit inside the included eight gigs with room for your application data.
Jordan: Egress — Supabase Pro includes two hundred fifty gigs of cached egress and two hundred fifty gigs of uncached per month per org. Overages are three cents per gig cached, nine cents uncached. For a RAG query endpoint serving a few hundred queries a day, you're nowhere near those limits.
Jordan: Storage beyond the included hundred gigs is about two cents per gig per month. Again — for a private knowledge base with a few thousand documents, you're not hitting this.
Jordan: The honest answer for most solo operators is that the Supabase bill for running a private RAG system is somewhere between twenty-five and fifty dollars a month on Pro. The embedding costs are pennies. The expensive part is your time setting it up — which is why the starter pack exists.
Jordan: I want to ground this in real deployments, not just my own projects.
Jordan: Berri AI — a Y Combinator company — migrated their production vector search from self-hosted pgvector on AWS RDS to Supabase Vector specifically to reduce operational overhead. Fewer services to manage, better tooling, same pgvector underneath. They were already running pgvector — they just wanted it managed.
Jordan: Firecrawl, which builds document search products, replaced Pinecone — a dedicated vector database — with Supabase pgvector. Their reasoning was cost and metadata handling. When your vectors live in Postgres, you get joins, you get RLS, you get all the relational tooling you already know. With a standalone vector DB, you're maintaining a separate data layer and syncing metadata between two systems.
Jordan: And on the academic side, the RCSB Protein Data Bank — the organization that maintains the global repository of protein structures — deployed a production help desk using PostgreSQL with pgvector and GPT-4.1-mini. This is a scientific support system handling depositor questions with citation-backed answers. If pgvector is production-ready for molecular biology, it can handle your client's FAQ bot.
Jordan: One more — anecdotal, but I love this one. Someone on the Supabase subreddit posted their HNSW config — m of twenty-four, efconstruction of sixty-four — and reported twenty-four hundred embeddings searched in forty milliseconds. With a weekly pgcron job re-ranking the results. That's a solo builder, not an engineering team. That's you.
Jordan: Okay. I've been making the case for Supabase pgvector RAG, and I believe in it. But I'd be doing you a disservice if I didn't tell you where it breaks down.
Jordan: The honest limitation is scale and complex filtering. There's a research paper on filtered approximate nearest neighbor search that documents how pgvector's query planner can choose suboptimal execution plans when you're combining vector similarity with heavy metadata filters. If you're filtering by tenant, by date range, by document type, and by access level all at once — and your corpus is in the millions of chunks — the planner might not pick the fastest path.
Jordan: HNSW indexes also consume real memory. At scale — we're talking millions of rows with high-dimensional vectors — the memory footprint matters. A dedicated vector database like Pinecone or Weaviate is purpose-built for this. They handle the index management, the sharding, the filtered search optimization at a level that Postgres wasn't originally designed for.
Jordan: And then there's the managed option — OpenAI's File Search, for example, removes the embedding lifecycle entirely. You upload files, they handle chunking, embedding, retrieval. Zero infrastructure on your side.
Jordan: So when do you move? Here's my decision gate. If your corpus is under a few hundred thousand chunks, you need per-tenant isolation, and you're already on Supabase — stay. The operational simplicity of one database, one set of RLS policies, one connection string is worth more than marginal latency improvements from a specialized store.
Jordan: If you're pushing into the millions of chunks, or your filtered queries are getting slow despite tuning iterative scans and probe counts — benchmark a dedicated store. But keep a migration path. Add an embedmodel and embedversion column to your chunks table now, so when you need to re-embed for a new model or a new store, you can do it incrementally without downtime. And if you want to be really cautious, build a dual-write adapter in your Edge Function — a feature flag that writes vectors to both Supabase and your target store simultaneously during migration.
Jordan: The point is — start with Postgres. Instrument it. Know your numbers. And have a plan for the day you outgrow it, even if that day never comes. For most solo operators building client-private Q and A systems, it won't.
Jordan: So — remember that question from the top? "I built a RAG chatbot. It worked in the demo. Two months later the client wants out." The answer was never the model. It was never the context window. It was three things the builder didn't set up: a bounded corpus so irrelevant docs can't contaminate the results, a refresh pipeline so embeddings stay current when the source material changes, and retrieval gates — lexical overlap plus a distance cap — so the system says "I don't know" instead of confidently making something up.
Jordan: That's the whole system. Supabase, pgvector, a GIN index for keyword probes, an embed queue with pg_cron, and a retrieval query that checks before it sends. If you want to build this without transcribing the episode, the Supabase RAG Starter Pack has the paste-ready SQL, the cron jobs, the Edge Function scaffold, and the Make and n8n ingestion blueprints. It's on the Resources page.
Jordan: Your one thing this week — go enable pgvector on a Supabase project and run the schema SQL. Don't build the whole pipeline yet. Just get the tables and indexes in place. That's roughly twenty minutes of work, and it's the foundation everything else plugs into.
Jordan: I'm Jordan. This is Headcount Zero. Go build something.