Build retrieval augmented generation without LangChain. Real RAG stack: vector DB, embedding API, reranking. Cognival's guide for founders and ops leads.
Most RAG tutorials show you how to build a toy. They glue together LangChain, Pinecone, and an OpenAI SDK, call it done, and ignore the 90% of production work that actually matters: embeddings cost, retrieval latency, hallucination control, and fallback logic when your vector DB goes down. Here's how to architect RAG that scales without technical debt.
Framework-based RAG (LangChain, LlamaIndex) abstracts away the decisions that kill production systems. You don't see the cost modeling. You don't see the retrieval quality gaps. You don't see what happens when Pinecone has a 2am outage.
Most tutorials skip the real numbers. Embedding 1M documents at $0.02 per 1K tokens costs you $30 upfront, then another $0.1 per query at scale. That adds up. Retrieval quality isn't binary either. You need reranking, chunk overlap strategy, and BM25 fallback. Frameworks hide these levers, which means you inherit their defaults and wonder why your system hallucinates.
The gap is real: vector DB outages happen. Model latency spikes. Weak retrieval rank causes hallucination. Production RAG requires naming these failure modes and building guardrails around them.
Think of how to build RAG from scratch as three independent layers, not one monolith.
Layer 1: Ingestion and embedding. Deterministic chunking, batch processing, versioning. You move documents in, split them into chunks, embed them, and store the embeddings with metadata. This layer is boring and deterministic. It should work the same way every time.
Layer 2: Retrieval. Vector search plus BM25 hybrid, reranking with a cross-encoder, semantic caching. This layer takes a query, finds candidate chunks, reranks them, and returns the top-N results. Speed and accuracy matter here.
Layer 3: Generation with guardrails. Context window management, citation tracking, confidence scoring. This layer takes retrieved context and the user question, generates an answer, and either returns it or rejects it based on confidence thresholds.
Specific tools: Supabase pgvector for vectors, Cohere Rerank for re-ranking, Claude for generation. These choices aren't flashy, but they work at scale and don't disappear in 18 months.
Fixed-size chunks (e.g., 512 tokens) are lazy and lose semantic boundaries. A 512-token chunk might split a definition in half or carry two unrelated ideas together. Your retrieval suffers.
Implement semantic chunking instead. Split on sentence or paragraph breaks, then enforce min/max token count (e.g., 200–600 tokens per chunk). This preserves meaning. Overlap strategy matters too. Use 20% overlap: the last sentence of chunk N becomes the first sentence of chunk N+1. This reduces retrieval gaps when a question bridges two chunks.
Store chunk metadata: source URL, creation date, section hierarchy. Better metadata means better reranking and citations. When your LLM returns an answer, you can point back to the exact source and timestamp.
Use an API-based embedding model (OpenAI, Cohere, or Hugging Face Inference) rather than self-hosted. Cost per embedding is 60–80% lower than most founders think. OpenAI charges $0.02 per 1M tokens. Cohere charges $1 per 1M embeddings. At scale, this matters.
Choose Supabase pgvector or Qdrant for storage. Both are production-solid and won't disappear in 18 months like specialized vector-DB startups. Avoid the trendy pure-vector databases; they're often VC-backed and fragile.
Store embeddings versioned. If you upgrade your embedding model, you need to know which version produced which vectors. This saves you from having to re-embed your entire corpus mid-production.
Calculate true cost: 1M documents at 1,500 average tokens equals 1.5B total tokens. At $0.02 per 1M tokens, ingestion costs $30. Querying at scale costs $0.1 per query. Budget these from day one.
Semantic search alone (vector similarity) retrieves relevant-looking results that miss nuance. A query about "returns policy" might surface a doc about "policy framework" that scored high on embeddings but contains zero actionable information.
Add BM25 keyword search as a fallback. Merge the top-K results from both methods, then rerank using a cross-encoder. Cohere Rerank costs $1 per 1M tokens and cuts hallucination dramatically.
Implement a confidence threshold. If the rerank score drops below 0.6, return "Unable to answer from available sources" rather than hallucinating. Users hate wrong answers more than no answers.
Caching cuts costs by 40–60%. Store (query, top-3 retrieval results) pairs for 24 hours. Repeat questions hit the cache and skip embedding and reranking entirely.
It will fail. Assume it.
Build a BM25-only fallback. Elasticsearch or even PostgreSQL full-text search works in a pinch. This layer of redundancy is not optional in production.
Log every retrieval failure: timeout, no results above threshold, DB down. Alert ops immediately. Implement graceful degradation. Don't return hallucinations. Return a cached response or a "connection unstable" message.
Test this path in staging. Vector DB failure is not rare. A silent degrade to hallucination is worse than no answer.
Track retrieval precision: the percentage of top-5 results actually relevant to the question. Aim for 80%+.
Measure hallucination rate by sampling 50 outputs per week and scoring them 0–5 on factual accuracy. Less than 2% hallucination rate means production-ready.
Log context window saturation. If queries consistently hit 80% of your max tokens, you're truncating useful context and degrading quality.
A/B test reranking on/off for a week. Measure user satisfaction and cost. Most teams see 15–25% accuracy improvement for a 2–5% cost increase. That trade-off is worth it.
Month 1: Ingest your data, embed it, store it in Supabase pgvector. Run semantic search queries manually to validate chunk quality. This is slow and painful. That's intentional. You're building the foundation.
Month 2: Wire up a reranker (Cohere or Jina). A/B test reranked versus non-reranked retrieval. Log confidence scores. You'll see which queries struggle and why.
Month 3: Build the generation layer with Claude API. Add system prompts for citation and tone. Implement confidence thresholds. Start testing on real queries.
Month 4: Stress-test failure modes. Simulate DB down, LLM rate limit, embedding API latency. Build fallbacks and monitoring. Only then go live.
This roadmap is boring. It's also how production systems survive past month 6.
No. Start with PostgreSQL and pgvector. It's free, it scales to millions of vectors, and you're not adding a new vendor. Move to a specialized vector DB only when you hit 10M+ vectors or need sub-100ms latency at 1000+ QPS. Most teams never reach that threshold.
Frameworks abstract away the decisions that matter in production. You can't control chunking strategy, embedding versioning, reranking, or fallback logic. You inherit their defaults and their bugs. Building RAG without frameworks takes 2–3 weeks longer but gives you a system you can debug and control.
For 1M documents and 1000 queries per day: embeddings cost $30 upfront plus $0.1 per 1000 queries ($3/day). Reranking adds $1 per 1M tokens queried ($1–2/day at scale). Claude API generation costs $0.003–0.015 per 1K tokens. Total: ~$100–150 per month for a real system. Add your own hosting for Supabase or Qdrant ($10–50/month). Budget $200–300/month for 1M documents and moderate traffic.
---
If you want to talk through applying this to your stack, book a strategy call at cognival.co/book.
30-min strategy call. No pitch, real look at your stack.
Book a strategy call →