Agentic
RAG
Academy
Go beyond naive RAG. Master agentic retrieval patterns — from chunking and embeddings to self-correcting retrieval, query planning, and production-grade architecture.
RAG Fundamentals
Retrieval-Augmented Generation from first principles
Retrieval-Augmented Generation (RAG) is a technique that gives LLMs access to external knowledge at inference time. Instead of relying solely on what the model memorized during training, RAG retrieves relevant documents and injects them into the context window before generation.
The term was coined by Patrick Lewis et al. at Meta AI in 2020. The core insight: LLMs are powerful reasoning engines, but their parametric knowledge is static, incomplete, and prone to hallucination. RAG decouples knowledge storage (the retrieval index) from knowledge reasoning (the LLM), making both independently improvable.
Why Naive RAG Fails
Most teams start with “naive RAG” — embed documents, retrieve top-k, stuff into prompt, generate. This works for demos but fails in production for predictable reasons:
Poor chunking
Splits mid-sentence, destroys context, creates incoherent fragments
No relevance filtering
Returns all top-k results even when none are relevant, injecting noise
Missing reranking
First-stage vector search is fast but imprecise — wrong docs rank high
No faithfulness check
Model hallucinates beyond the retrieved context with no verification
The RAG Pipeline
Every RAG system follows three stages. The quality of each stage compounds — poor indexing guarantees poor retrieval, which guarantees poor generation.
Indexing
Split documents into chunks, generate embeddings for each chunk, and store them in a vector database with metadata.
- ▸Load documents (PDFs, HTML, Markdown, databases)
- ▸Split into semantically meaningful chunks
- ▸Generate vector embeddings for each chunk
- ▸Store vectors + metadata in a vector database
Retrieval
When a query arrives, embed it, search the vector database for similar chunks, and return the top-k most relevant results.
- ▸Embed the user query into the same vector space
- ▸Search the vector index for nearest neighbors
- ▸Apply filters (metadata, relevance threshold)
- ▸Optionally rerank results for precision
Generation
Inject the retrieved chunks into the LLM's context window alongside the user query, and generate a grounded response.
- ▸Format retrieved chunks with source attribution
- ▸Construct a prompt with context + query
- ▸Generate a response grounded in the sources
- ▸Optionally verify faithfulness of the answer
RAG vs. Fine-Tuning: When to Use Each
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Instant — update the index | Requires retraining the model |
| Source attribution | Natural — can cite sources | Not possible — knowledge is in weights |
| Cost | Per-query retrieval + generation | High upfront training, cheap inference |
| Hallucination | Reduced (but not eliminated) | Common — model confabulates |
| Best for | Factual QA, documentation, support | Style, format, behavior changes |
Rule of thumb: Use RAG when the model needs to access specific, updatable facts. Use fine-tuning when you need to change the model's behavior, style, or output format. In many production systems, you use both — fine-tune for behavior, RAG for knowledge.
Key Insight
RAG doesn't eliminate hallucination — it reduces it by grounding the model in retrieved evidence. The gap between a demo RAG system and a production RAG system is the difference between “it works on my 10 test questions” and “it works reliably on 10,000 diverse user queries.” Closing that gap requires systematic chunking, retrieval, evaluation, and iteration.
“RAG is not just a technique — it's a paradigm shift. Instead of cramming knowledge into model weights, you give the model a library card.”
Jerry Liu
Co-founder & CEO, LlamaIndex
“The biggest mistake teams make with RAG is treating retrieval as a solved problem. Your RAG system is only as good as the retrieval step.”
Harrison Chase
Co-founder & CEO, LangChain
“Vector search alone is not enough. The future of RAG is hybrid search combined with learned reranking — that's where the real accuracy gains come from.”
Bob van Luijt
Co-founder & CEO, Weaviate
Chunking Strategies
Semantic, recursive, agentic chunking
Chunking is the process of splitting documents into smaller pieces for indexing and retrieval. It is the most under-appreciated step in the RAG pipeline — and the one with the highest ROI when done right. Poor chunking creates “chunk soup” — incoherent fragments that poison retrieval quality.
Chunk Overlap: The Single Highest-ROI Improvement
Without overlap, information at chunk boundaries is split between two chunks. Neither chunk has the complete thought, so retrieval misses it. A 10-20% overlap (e.g., 100-200 tokens for 1000-token chunks) ensures boundary information appears in at least one complete chunk.
Chunking Strategies Compared
Split text into chunks of a fixed character or token count. The simplest approach but the lowest quality.
Pros
- + Fast and deterministic
- + Easy to implement
- + Predictable chunk sizes
Cons
- - Splits mid-sentence and mid-paragraph
- - Destroys semantic coherence
- - No awareness of document structure
Split on a hierarchy of separators (paragraphs > sentences > words), trying the largest separator first and falling back to smaller ones.
Pros
- + Respects paragraph and sentence boundaries
- + Configurable separator hierarchy
- + Good balance of quality and simplicity
Cons
- - Still uses a fixed target chunk size
- - May not respect semantic boundaries perfectly
- - Overlap can duplicate content
Use embeddings to find natural breakpoints. Embed each sentence, then split where the cosine similarity between consecutive sentences drops below a threshold.
Pros
- + Chunks align with topic boundaries
- + Variable chunk sizes that match content
- + No arbitrary size limits
Cons
- - Requires embedding every sentence (slow for large corpora)
- - Needs a similarity threshold to tune
- - More complex to implement
Use an LLM to determine chunk boundaries. The model reads through the document and decides where semantically complete units begin and end.
Pros
- + Highest semantic coherence
- + Handles complex document structures
- + Can add chunk-level summaries
Cons
- - Expensive — requires LLM calls per document
- - Slow for large corpora
- - Non-deterministic chunking
Use document format-specific parsers (Markdown headers, HTML tags, PDF sections, code AST) to split at structural boundaries.
Pros
- + Respects actual document structure
- + Headers become natural chunk boundaries
- + Preserves hierarchical context
Cons
- - Requires format-specific parsers
- - Sections may exceed target chunk size
- - Doesn't work for unstructured text
Chunk Size vs. Retrieval Quality
Smaller chunks are more precise (match specific facts) but have lower recall (miss surrounding context). Larger chunks provide more context but are less precise. There is no universal optimum — it depends on your use case.
| Chunk Size | Precision | Recall | Best For |
|---|---|---|---|
| 128 tokens | High | Low | Exact fact lookup |
| 256 tokens | High | Medium | QA, chatbots |
| 512 tokens | Medium | Medium | General purpose |
| 1024 tokens | Medium | High | Summarization |
| 2048 tokens | Low | High | Document-level context |
Metadata Preservation
Every chunk should carry metadata: source document title, section header, page number, chunk index, and last updated date. Without metadata, you can't filter by recency, can't attribute sources, and can't implement document versioning. The parent-child pattern — index small chunks for precise retrieval but return the parent chunk for context — gives you the best of both worlds.
Embeddings & Vector Databases
Choosing models and storage for your use case
Embeddings are the bridge between human-readable text and machine-searchable vectors. An embedding model converts a text chunk into a dense vector of floating-point numbers that captures its semantic meaning. Vector databases store and index these vectors for fast similarity search at scale.
Your choice of embedding model determines the quality of your vector space — and therefore the quality of retrieval. Your choice of vector database determines the operational characteristics: latency, scale, cost, and features.
Never mix embedding models in the same index. If you embed documents with Model A and query with Model B, the vector spaces won't align and similarity scores become meaningless. When you change embedding models, you must re-embed your entire corpus. Version your indices to enable rollback.
Embedding Models Compared
OpenAI text-embedding-3-large
~$0.13 per 1M tokensStrengths
Best general-purpose model. Matryoshka support allows dimension reduction without re-embedding.
Trade-offs
Proprietary, requires API calls, cost scales with volume.
Cohere embed-v3
~$0.10 per 1M tokensStrengths
Strong multilingual support. Built-in input types (search_document, search_query) improve retrieval.
Trade-offs
Proprietary. Slightly behind OpenAI on English-only benchmarks.
BGE-large-en-v1.5 (BAAI)
Free (self-hosted)Strengths
Open-source. Competitive with proprietary models. Can be self-hosted for zero marginal cost.
Trade-offs
English-only. Requires GPU for fast inference. Slightly lower quality than commercial options.
Jina embeddings-v3
~$0.02 per 1M tokensStrengths
Supports 8K token input. Task-specific adapters for retrieval, classification, separation.
Trade-offs
Newer model, less battle-tested in production. Requires API or self-hosting.
Nomic embed-text-v1.5
Free (self-hosted)Strengths
Fully open-source with open training data. Long context (8K tokens). Matryoshka support.
Trade-offs
Lower dimensions may reduce precision for very large indices.
Choosing Embedding Dimensions
Higher dimensions capture more nuance but cost more to store and search. Modern models with Matryoshka embeddings (OpenAI, Nomic) let you truncate vectors to smaller dimensions after generation without re-embedding. This enables a key optimization:
Start with full dimensions, measure quality, then reduce until you find the smallest dimension that meets your accuracy requirements.
Vector Databases Compared
Strengths
Fully managed, serverless option, excellent scaling, hybrid search support.
Best For
Teams that want zero infrastructure management. Production workloads at scale.
Considerations
Vendor lock-in. No self-hosted option. Cost scales with stored vectors.
Strengths
Native hybrid search (dense + BM25), modular architecture, GraphQL API, multi-tenancy.
Best For
Hybrid search use cases. Teams wanting open-source with managed cloud option.
Considerations
Higher memory usage than some alternatives. Learning curve for module system.
Strengths
Simplest API. In-memory for development, persistent for production. Python-native.
Best For
Prototyping and development. Small to medium datasets. Python-first teams.
Considerations
Less mature for very large-scale production. Fewer enterprise features.
Strengths
Rust-based (fast). Rich filtering. Payload storage. gRPC + REST APIs.
Best For
Performance-critical applications. Complex filtering requirements.
Considerations
Smaller community than Pinecone/Weaviate. Newer managed cloud offering.
Strengths
Uses existing Postgres infrastructure. ACID transactions. Familiar SQL interface.
Best For
Teams already on PostgreSQL who want vector search without a new database.
Considerations
Slower than purpose-built vector DBs at scale. Limited to 2000 dimensions. HNSW index tuning required.
Distance Metrics
Distance metrics define how similarity is measured between vectors. The choice of metric affects both accuracy and performance.
Cosine Similarity
Measures the angle between two vectors, ignoring magnitude. Most common for text embeddings because document length doesn't affect similarity.
Euclidean Distance (L2)
Measures the straight-line distance between two points in vector space. Sensitive to magnitude — longer documents have larger vectors.
Dot Product (Inner Product)
Measures both angle and magnitude. Higher values indicate greater similarity. Fastest to compute but affected by vector magnitude.
Key Insight
The most common mistake teams make is choosing an embedding model based on generic benchmarks (MTEB leaderboard) without testing on their own data. A model that ranks #1 on general benchmarks may rank #5 on your domain-specific queries. Always benchmark 2-3 models on a representative sample of your actual queries before committing.
Retrieval Strategies
Hybrid search, reranking, query expansion
Retrieval is the most impactful stage of the RAG pipeline. If you retrieve the wrong documents, no amount of prompt engineering will fix the generated answer. There are three fundamental retrieval approaches, and the best production systems combine them.
Encode queries and documents into dense vectors using embedding models, then find nearest neighbors in vector space. The core of modern RAG systems.
How It Works
- 1.Embed the query into the same vector space as documents
- 2.Use Approximate Nearest Neighbor (ANN) algorithms (HNSW, IVF) for fast search
- 3.Return top-k documents ranked by cosine similarity or dot product
Strengths
Captures semantic meaning — 'car' matches 'automobile'. Handles paraphrases and conceptual queries well.
Weaknesses
Poor at exact keyword matching ('error code E-4012'). Requires quality embedding models. ANN search is approximate, not exact.
Traditional keyword-based retrieval using term frequency statistics. BM25 is the modern standard — it weighs term frequency, document length, and inverse document frequency.
How It Works
- 1.Tokenize query and documents into terms
- 2.Score each document based on term overlap (BM25 formula)
- 3.Rank by score — exact keyword matches rank highest
Strengths
Excellent for exact terms: error codes, product IDs, names, acronyms. Fast, deterministic, explainable. No embedding model needed.
Weaknesses
No semantic understanding — 'car' doesn't match 'automobile'. Misses paraphrases and conceptual queries entirely.
Combine dense and sparse retrieval to get the best of both worlds: semantic understanding from vectors AND exact keyword matching from BM25.
How It Works
- 1.Run dense retrieval (vector search) to get ranked list A
- 2.Run sparse retrieval (BM25) to get ranked list B
- 3.Fuse the two lists using Reciprocal Rank Fusion (RRF) or weighted scoring
- 4.Return the merged, fused results
Strengths
Consistently outperforms either approach alone in benchmarks. Catches both semantic and keyword matches. Handles diverse query types.
Weaknesses
Requires maintaining two indices (vector + inverted). Slightly higher latency. Need to tune fusion weights.
Reciprocal Rank Fusion (RRF)
RRF merges two ranked lists into a single list by scoring each document based on its rank in both lists. It's the standard fusion algorithm for hybrid search because it's simple, effective, and doesn't require score calibration between the two retrieval methods.
Reranking: The Precision Pass
First-stage retrieval (vector search, BM25) is optimized for speed — scanning millions of documents in milliseconds. But speed comes at the cost of precision. A reranker is a cross-encoder model that reads the query and each document together, producing a much more accurate relevance score. The pattern: retrieve broadly (top-20), rerank for precision (top-5).
Cohere Rerank
APICommercial cross-encoder reranker. Supports 100+ languages. Simple API: pass query + documents, get reranked scores.
~25% precision improvement over vector search alone
ColBERT (v2)
Open-sourceLate interaction model — encodes query and document tokens separately, then computes fine-grained similarity. Faster than traditional cross-encoders.
Near cross-encoder quality at 100x the speed
BGE Reranker (BAAI)
Open-sourceOpen-source cross-encoder reranker. Multiple sizes (small, base, large) for speed/quality tradeoff. Can be self-hosted.
Competitive with commercial options on English benchmarks
FlashRank
Open-sourceUltra-lightweight reranker designed for low-latency applications. Under 100MB model size. Runs on CPU.
Lower quality but <10ms latency. Good for real-time applications.
Advanced Retrieval Techniques
Maximal Marginal Relevance (MMR)
Balances relevance with diversity. Without MMR, your top-5 results might be five near-identical chunks about the same subtopic, missing other relevant information.
Contextual Compression
After retrieval, extract only the relevant sentences from each chunk. A 500-token chunk might contain only 50 tokens of information relevant to the query.
Multi-Query Retrieval
Generate multiple reformulations of the user's query, retrieve for each, and merge results. Captures different aspects of the query's intent.
Parent Document Retrieval
Index small chunks for precise matching, but return the larger parent chunk (or full document section) for generation context.
Key Insight
The retrieval pipeline should be: hybrid search (cast a wide net) then reranking (sort by true relevance) then MMR diversity (avoid redundancy) then relevance threshold (drop irrelevant results). Each layer tightens precision. Skipping any layer is the primary cause of poor RAG quality in production.
Agentic RAG Patterns
Self-RAG, corrective RAG, adaptive retrieval
Agentic RAG moves beyond the fixed retrieve-then-generate pipeline. Instead of blindly retrieving for every query, agentic patterns let the system decide when to retrieve, what to retrieve, and whether the retrieved context is good enough to generate a faithful answer. These patterns are the difference between demo-quality and production-quality RAG.
The Evolution of RAG
Naive RAG
Retrieve, stuff, generate
Advanced RAG
Hybrid search, reranking, filtering
Agentic RAG
Self-correcting, multi-step, adaptive
Self-RAG trains the model to emit special reflection tokens that control the retrieval-generation loop. The model decides IF it needs retrieval, evaluates document relevance, generates with grounding, and self-assesses whether the answer is supported by the sources.
How It Works
- 1.Given a query, the model emits a [RETRIEVE] or [NO_RETRIEVE] token to decide if external knowledge is needed
- 2.If retrieving, the model evaluates each document with [RELEVANT] or [IRRELEVANT] tokens
- 3.The model generates a response using only relevant documents
- 4.A [SUPPORTED] or [NOT_SUPPORTED] self-assessment token verifies grounding
- 5.If not supported, the pipeline retries with refined retrieval
Key Benefit
30-50% reduction in hallucination compared to naive RAG. The model learns when NOT to retrieve, avoiding unnecessary latency.
Trade-off
Requires fine-tuning the model to emit reflection tokens, or prompt engineering to simulate them. Adds 2-3 LLM calls per query.
CRAG adds a lightweight retrieval evaluator between the retrieval and generation stages. It scores document relevance and triggers one of three actions: use the documents (correct), refine the knowledge (ambiguous), or fall back to web search (incorrect).
How It Works
- 1.Retrieve documents using standard RAG pipeline
- 2.A retrieval evaluator scores each document's relevance to the query
- 3.If confidence is HIGH: use documents directly for generation
- 4.If confidence is MEDIUM: extract key sentences via knowledge refinement
- 5.If confidence is LOW: discard documents and fall back to web search
Key Benefit
Prevents the model from generating answers based on irrelevant context. The web search fallback handles knowledge gaps gracefully.
Trade-off
Adds latency for the evaluation step. Web search fallback requires internet access and introduces external dependency.
Adaptive RAG classifies incoming queries by complexity and routes them to different retrieval strategies. Simple factual questions skip retrieval entirely, moderate questions use single-step RAG, and complex questions trigger multi-step agentic retrieval.
How It Works
- 1.A query complexity classifier categorizes the question (simple / moderate / complex)
- 2.Simple queries (e.g., 'What year was X founded?') go directly to the LLM — no retrieval needed
- 3.Moderate queries use standard single-step RAG with reranking
- 4.Complex queries trigger iterative multi-step retrieval with query decomposition
- 5.The system dynamically allocates compute based on query difficulty
Key Benefit
Reduces average latency by 40-60% by avoiding unnecessary retrieval for simple questions. Allocates more compute to questions that need it.
Trade-off
Requires training or prompt-engineering a reliable query classifier. Misclassification can send complex queries down the simple path.
The agent treats retrieval as a tool it can call. Instead of a fixed pipeline, the LLM decides when to search, what to search for, and when it has enough information to answer. The retriever is one tool among many.
How It Works
- 1.The agent has access to tools: vector_search, web_search, sql_query, knowledge_graph, etc.
- 2.Given a query, the agent reasons about what information it needs
- 3.It calls the appropriate retrieval tool with a refined search query
- 4.It evaluates the results and decides if more retrieval is needed
- 5.It can chain multiple retrieval calls, combining results before answering
Key Benefit
Maximum flexibility — the agent can combine multiple data sources, refine queries iteratively, and handle novel question types without pipeline changes.
Trade-off
More expensive (multiple LLM calls). Harder to debug and evaluate. Requires good tool descriptions and reliable function calling.
Pattern Comparison
| Pattern | Retrieval | Evaluation | Correction | Latency | Quality |
|---|---|---|---|---|---|
| Naive RAG | Always, single-step | None | None | Low | Low-Medium |
| Self-RAG | Conditional (model decides) | Relevance + faithfulness tokens | Retry with rephrased query | Medium-High | High |
| CRAG | Always, single-step | External evaluator | Web search fallback | Medium | High |
| Adaptive RAG | Complexity-based routing | Query classifier | Route to correct pipeline | Low-High (varies) | High |
| Agentic RAG | Multi-step, tool-based | Agent reasoning | Iterative refinement | High | Highest |
Key Insight
Start with naive RAG, measure with RAGAS, and add agentic layers only when evaluation shows they're needed. Self-RAG is worth the complexity for high-stakes applications (medical, legal, financial). CRAG is the best bang-for-buck upgrade for most production systems. Adaptive RAG pays off when your query distribution has high variance in complexity.
Query Routing & Planning
Multi-step retrieval with query decomposition
Not every query should hit the same retrieval pipeline. Query routing directs queries to the right retrieval strategy, while query decomposition breaks complex questions into focused sub-queries for better retrieval. Together, they form the “planning” layer of agentic RAG.
Routing Patterns
Classify the incoming query into categories and route each category to a specialized retrieval pipeline. The simplest and most reliable routing approach.
How It Works
- 1.Define query categories: factual, analytical, comparison, how-to, opinion
- 2.Use an LLM or lightweight classifier to categorize the incoming query
- 3.Route each category to its optimal retrieval strategy
- 4.Factual queries use vector search; analytical queries use SQL; comparisons use multi-query retrieval
When your knowledge spans multiple backends (vector store, SQL database, knowledge graph, API), route the query to the appropriate data source based on the type of information needed.
How It Works
- 1.Register available data sources with their descriptions and capabilities
- 2.LLM determines which data source(s) are needed for the query
- 3.Route to one or more backends in parallel
- 4.Merge results from multiple sources into a unified context
Use embeddings to route queries without an LLM call. Pre-compute embeddings for example queries in each route, then match incoming queries to the nearest route by cosine similarity.
How It Works
- 1.Define routes with 5-10 example utterances each
- 2.Pre-compute embeddings for all example utterances
- 3.When a query arrives, embed it and find the nearest route centroid
- 4.Route to the matched pipeline — no LLM call needed, sub-10ms routing
Query Decomposition Strategies
A single embedding cannot capture all dimensions of a complex, multi-faceted query. Decomposition breaks queries into focused sub-queries, each producing a targeted embedding that matches the right documents.
Sequential Decomposition
Break a complex query into a chain of sub-queries where each step depends on the previous result. Best for queries with logical dependencies.
Parallel Decomposition
Break a query into independent sub-queries that can be retrieved in parallel, then merge results. Best for comparison and multi-faceted questions.
Step-Back Prompting
Instead of directly retrieving for a specific query, first generate a more general (abstracted) query, retrieve for that, then use the broader context to answer the specific question.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer to the query, embed that answer instead of the query, and use it for retrieval. The hypothetical answer is closer in embedding space to the actual documents.
The Full Multi-Step Retrieval Pipeline
In a production agentic RAG system, query routing and decomposition are just two steps in a six-stage pipeline. Each stage transforms the query or its results to maximize answer quality.
Query Analysis
Classify complexity, identify intent, detect entities
Route Selection
Pick the right retrieval strategy and data sources
Query Transformation
Decompose, expand, or rephrase for better retrieval
Parallel Retrieval
Execute sub-queries across selected data sources
Result Fusion
Merge, deduplicate, rerank, and filter results
Generation
Generate answer with faithfulness self-check
Key Insight
Query routing and decomposition have the highest ROI for complex, multi-faceted questions. If your queries are simple factual lookups, they add unnecessary latency. Profile your query distribution first: if more than 30% of queries are multi-hop or comparative, invest in query planning. For the rest, a semantic router (no LLM call) is sufficient.
Evaluating RAG Systems
Faithfulness, relevance, and answer quality
You cannot improve what you cannot measure. RAG evaluation requires separating retrieval quality from generation quality — a wrong answer could stem from bad retrieval (wrong documents) or bad generation (hallucination despite good documents). The RAGAS framework provides four metrics that isolate each failure mode.
Why “Vibes-Based” Evaluation Fails
Manual spot-checking creates a dangerous illusion of quality. A change that improves 10 queries but silently breaks 50 others goes undetected. Without systematic evaluation, you cannot tell if a new chunking strategy, embedding model, or prompt change actually improved your system. Evaluation is not optional — it is the foundation of reliable RAG.
The Four RAGAS Metrics
RAGAS (Retrieval Augmented Generation Assessment) decomposes RAG quality into four orthogonal dimensions. Together, they tell you exactly where your system is failing.
How It's Computed
Decompose the answer into atomic claims. For each claim, check if it can be inferred from the provided context. Score = (supported claims) / (total claims).
What It Catches
Hallucination — the model generating facts not present in any retrieved document.
Diagnostic Signal
Low faithfulness with high context recall = generation problem. The model has the right context but fabricates beyond it.
How It's Computed
Generate N questions from the answer using an LLM. Compute the mean cosine similarity between these generated questions and the original question. Higher similarity = more relevant answer.
What It Catches
Off-topic answers — the model using retrieved context to generate an answer about something adjacent but not what was asked.
Diagnostic Signal
Low relevancy with high faithfulness = the retrieved context was relevant to a different aspect than what the user asked about.
How It's Computed
For each retrieved document, check if it is relevant to the ground truth answer. Weight by rank position — relevant documents ranked higher contribute more to the score.
What It Catches
Noisy retrieval — irrelevant documents ranked above relevant ones, wasting context window space.
Diagnostic Signal
Low precision = retrieval is returning too many irrelevant documents. Improve your reranking or relevance thresholds.
How It's Computed
Decompose the ground truth answer into claims. For each claim, check if it can be attributed to any retrieved document. Score = (attributed claims) / (total ground truth claims).
What It Catches
Missing retrieval — relevant documents exist in the index but were not retrieved for this query.
Diagnostic Signal
Low recall with high precision = you are retrieving well but not enough. Increase topK or use query expansion.
Diagnostic Matrix: Where Is the Problem?
| Symptom | Metrics | Fix |
|---|---|---|
| Hallucinated facts | Low faithfulness, high context recall | Improve generation prompt. Add citation requirements. Use a more capable model. |
| Off-topic answers | Low relevancy, high faithfulness | Improve query transformation or retrieval strategy. The model answers what it has, not what was asked. |
| Missing information | Low context recall, high precision | Increase topK. Use query expansion or decomposition. Check if documents exist in the index. |
| Noisy retrieval | Low context precision, high recall | Add reranking. Raise relevance thresholds. Improve chunking quality. |
| Total failure | Low across all metrics | Fundamental pipeline issue. Check embedding model, chunking strategy, and index freshness. |
The RAG Evaluation Pipeline
Build a Golden Dataset
Create 50-200 question-answer pairs with ground truth answers. Include easy, medium, and hard questions. Cover edge cases and common failure modes.
- ▸Start with real user questions from production logs
- ▸Have domain experts write ground truth answers
- ▸Include questions with no answer in the corpus (to test abstention)
- ▸Version your dataset and grow it over time
Run the RAG Pipeline
For each question in the golden dataset, run your full RAG pipeline and capture: the generated answer, retrieved context documents, and any intermediate outputs.
- ▸Log the full pipeline trace (query, retrieved docs, reranked docs, answer)
- ▸Capture latency and token usage per step
- ▸Run in a reproducible environment (same model version, same index)
Score with RAGAS Metrics
Compute faithfulness, answer relevancy, context precision, and context recall for each question. Aggregate across the full dataset.
- ▸Set target thresholds: faithfulness > 0.85, relevancy > 0.80
- ▸Break down scores by question category to find systematic weaknesses
- ▸Track metrics over time to detect regressions
Diagnose and Iterate
Use the metric breakdown to identify whether problems are in retrieval or generation. Fix the weakest link first.
- ▸Low context precision/recall = fix retrieval (chunking, embeddings, search)
- ▸Low faithfulness = fix generation (prompt, model, or add citation requirements)
- ▸Low relevancy = fix query transformation or system prompt
Evaluation Frameworks
TruLens
Open-source RAG evaluation with feedback functions for groundedness, relevance, and toxicity. Integrates with LlamaIndex and LangChain.
DeepEval
Pytest-like framework for LLM evaluation. Supports RAG-specific metrics, unit tests for LLM outputs, and CI/CD integration.
Phoenix (Arize)
Observability platform with RAG-specific tracing, evaluation, and debugging. Visualize retrieval quality and identify failure patterns.
promptfoo
CLI-based eval framework. Define test cases in YAML, run against multiple RAG configurations, compare results side-by-side.
Key Insight
The single most impactful thing you can do for your RAG system is build a golden evaluation dataset of 50-100 questions with ground truth answers. Run RAGAS metrics after every change. Set minimum thresholds (faithfulness > 0.85) and block deployments that regress. This is the RAG equivalent of unit tests — and like unit tests, the earlier you start, the more pain you avoid.
Production RAG Architecture
Scaling, caching, and real-time indexing
The gap between a RAG demo and a production RAG system is operational engineering: caching, indexing pipelines, monitoring, cost optimization, and scaling. A system that works on 100 test queries must work reliably on 100,000 diverse production queries per day — with predictable latency, cost, and quality.
Cache answers for semantically similar queries. Common questions get asked hundreds of times daily in production — semantic caching catches paraphrases that exact-match caching misses.
How It Works
- 1.Embed the incoming query
- 2.Search the cache index for queries with similarity > 0.95
- 3.If a cache hit with valid TTL exists, return the cached answer
- 4.On cache miss, run the full RAG pipeline
- 5.Store the query embedding, answer, and sources in the cache with a TTL
Impact
60-80% cost reduction in production. Reduces average latency from 2-5s to <100ms for cache hits.
Consideration
Set appropriate TTL (1-24 hours). Invalidate cache when source documents change. Monitor cache hit rate.
Instead of re-indexing the entire corpus when documents change, detect changes and update only the affected vectors. Keeps the index fresh without full rebuilds.
How It Works
- 1.Hash each document's content at ingestion time
- 2.On update, compare content hashes to detect which documents changed
- 3.Delete old vectors for changed documents
- 4.Re-chunk, re-embed, and upsert only the changed documents
- 5.Handle deletions by removing vectors for deleted source documents
Impact
Index updates go from hours (full re-index) to minutes (incremental). Source freshness improves from daily to near real-time.
Consideration
Maintain a document-to-vector mapping for deletion. Use content hashing, not timestamps, to detect changes.
Use multiple retrieval tiers: a fast, approximate first pass (HNSW) followed by an exact re-scoring pass. For extremely large indices, add a pre-filter stage using metadata.
How It Works
- 1.Tier 1: Metadata pre-filter (namespace, date range, document type) — milliseconds
- 2.Tier 2: ANN search (HNSW) on filtered subset — 10-50ms for millions of vectors
- 3.Tier 3: Cross-encoder reranking on top-20 results — 50-200ms
- 4.Tier 4: Contextual compression to extract relevant sentences — 100-500ms
Impact
Handles 100M+ vector indices with sub-second latency. Each tier narrows the search space for the next.
Consideration
Profile each tier's latency. Set timeouts. Use async/parallel where possible. The reranking tier is usually the bottleneck.
Production Indexing Architecture
A production RAG system is not just a vector database. It is a pipeline of components that must work together reliably.
Document Watcher
Monitors source systems (S3, databases, wikis, git repos) for changes. Emits events for created, updated, and deleted documents.
Chunking Pipeline
Receives raw documents, applies format-specific parsing, runs the chunking strategy, and enriches chunks with metadata.
Embedding Service
Batches chunks and generates embeddings. Rate-limited to respect API quotas. Supports multiple embedding models for A/B testing.
Vector Store
Stores vectors with metadata, supports ANN search, handles upserts and deletions. The primary retrieval backend.
Cache Layer
Semantic cache for query-answer pairs. Exact cache for frequent identical queries. Invalidated on source document changes.
Monitoring & Alerts
Tracks indexing lag, retrieval latency (p50/p95/p99), cache hit rate, embedding costs, and RAGAS metric trends.
Operational Metrics to Track
| Metric | Target | Why |
|---|---|---|
| Retrieval Latency (p50) | < 100ms | User-perceived speed. p50 represents the typical experience. |
| Retrieval Latency (p99) | < 500ms | Tail latency. Worst-case user experience. Set alerts here. |
| Cache Hit Rate | > 40% | Higher = more cost savings. Below 20% means caching isn't helping. |
| Index Freshness | < 15 min lag | Time between source document change and index update. |
| Faithfulness Score | > 0.85 | RAGAS faithfulness on production samples. Below 0.8 = hallucination risk. |
| Context Precision | > 0.75 | Are retrieved documents actually relevant? Tracks retrieval quality. |
| Embedding Cost / 1K queries | Budget-dependent | Track cost per query to detect inefficiencies or budget overruns. |
| Error Rate | < 0.1% | Failed retrievals, timeouts, or pipeline errors. |
Cost Optimization Strategies
Reduce embedding dimensions
50-92% storageUse Matryoshka embeddings to reduce from 3072d to 1024d or 256d with minimal quality loss. Measure before committing.
Semantic caching
60-80% LLM costCache answers for similar queries. The highest-ROI optimization for production RAG systems with repeated query patterns.
Batch embedding
30-50% embedding costBatch chunks during indexing instead of embedding one at a time. Most APIs offer batch discounts or higher throughput.
Tiered models
40-70% LLM costUse a small, fast model (GPT-4o-mini, Claude Haiku) for simple queries and route complex queries to larger models.
Contextual compression
30-60% token costExtract only relevant sentences from retrieved chunks before sending to the LLM. Reduces input tokens significantly.
Key Insight
The three highest-ROI production investments are: (1) semantic caching — reduces cost and latency for repeated queries, (2) incremental indexing — keeps your knowledge base fresh, and (3) RAGAS monitoring — catches quality regressions before users do. Start with these three before optimizing anything else.
Advanced Patterns
Graph RAG, multimodal RAG, conversational RAG
Beyond standard agentic RAG, advanced patterns tackle specialized challenges: retrieving over knowledge graphs, handling images and tables, maintaining multi-turn conversations, and building hierarchical document representations. These patterns address the long tail of RAG failures that basic pipelines cannot solve.
Graph RAG
Graph RAG (Microsoft Research, 2024) builds a knowledge graph from your documents and uses graph-based retrieval alongside vector search. It excels at global, thematic questions that require synthesizing information across many documents — a task where standard RAG fails.
How It Works
- 1.Extract entities and relationships from documents using an LLM
- 2.Build a knowledge graph connecting entities across the entire corpus
- 3.Detect communities (clusters of related entities) using graph algorithms
- 4.Generate summaries for each community at multiple levels of abstraction
- 5.At query time, traverse the graph to retrieve relevant community summaries
Strengths
- + Answers global questions ('What are the main themes across all documents?')
- + Captures relationships that vector search misses
- + Provides multi-hop reasoning paths
- + Community summaries reduce context window usage
Weaknesses
- - Expensive to build — requires many LLM calls for entity extraction
- - Graph construction is slow for large corpora
- - Requires maintenance as documents change
- - Overkill for simple factual QA
RAPTOR: Hierarchical Retrieval
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds a hierarchical tree of document summaries. Leaf nodes are individual chunks, intermediate nodes are cluster summaries, and the root captures the full corpus theme.
| Level | Content | Granularity | Best For |
|---|---|---|---|
| Leaf | Individual document chunks (original text) | Most specific | Detailed factual questions |
| Cluster | Summaries of related chunks (5-10 chunks per cluster) | Medium | Topic-level questions |
| Section | Summaries of related clusters | Broad | Thematic questions |
| Root | Summary of the entire corpus | Broadest | Global overview questions |
Multimodal RAG
Multimodal RAG extends retrieval to images, tables, charts, and diagrams alongside text. Critical for domains where knowledge is encoded visually — technical documentation, medical imaging, financial reports with charts, and slide decks.
Approaches (from simple to sophisticated)
Text-Only Extraction
Convert images/tables to text descriptions, embed as text. Simple but lossy — complex diagrams lose information.
Multimodal Embeddings
Use models like CLIP or Jina CLIP to embed images and text into the same vector space. Query with text, retrieve images.
Vision LLM Summarization
Use GPT-4V or Claude to generate detailed descriptions of images/charts. Embed these rich descriptions alongside source images.
Native Multimodal RAG
Store images as-is and pass them directly to a multimodal LLM (GPT-4V, Claude) alongside text chunks in the context window.
Conversational RAG
Conversational RAG handles multi-turn conversations where follow-up questions reference previous context. The key challenge: 'Tell me more about that' requires resolving 'that' to the subject from the previous turn before retrieval.
Key Challenges
Coreference Resolution
'What about their pricing?' — 'their' refers to a company mentioned 3 turns ago. The retrieval query must resolve the reference before embedding.
Solution
Rewrite the query to be standalone: 'What is [Company X] pricing?' using the conversation history.
Context Accumulation
Over multiple turns, the user builds up a complex information need. The final query only makes sense in context of the full conversation.
Solution
Maintain a running conversation summary. Use it to enrich each query with accumulated context.
Topic Drift
The user gradually shifts topics. The retrieval system must detect when the topic changes and avoid retrieving context from the old topic.
Solution
Track topic boundaries in the conversation. Reset retrieval context when a new topic begins.
Conversational RAG Pipeline
- 1.Receive follow-up question in conversation context
- 2.Rewrite the question as a standalone query using conversation history
- 3.Retrieve using the standalone query (not the original follow-up)
- 4.Generate answer with both retrieved context and conversation history
- 5.Update conversation summary for the next turn
Emerging Patterns
Late Chunking
EmergingEmbed the full document first using a long-context embedding model, then split the embedding into chunk-level representations. Each chunk's embedding is contextualized by the full document.
Speculative RAG
ResearchGenerate multiple draft answers from different subsets of retrieved documents using a smaller model. A larger model then verifies and selects the best draft. Reduces hallucination while managing cost.
RAG + Fine-Tuning Hybrid
ProductionFine-tune a model to be a better RAG consumer: follow citations more strictly, abstain when context is insufficient, and maintain output format. The fine-tuning improves RAG behavior, not knowledge.
Agentic Indexing
EmergingUse an LLM agent to actively manage the index: identify knowledge gaps, suggest new documents to ingest, detect outdated content, and generate synthetic QA pairs for under-covered topics.
Key Insight
Advanced patterns solve real problems, but they add complexity. Graph RAG is worth it when you need cross-document synthesis (enterprise knowledge bases, research corpora). Multimodal RAG is essential when your knowledge includes images, charts, or tables. Conversational RAG is required for any chatbot. In every case, start with the simplest pattern that meets your needs and upgrade based on evaluation metrics, not intuition.
Anti-Patterns & Failure Modes
The most common RAG mistakes and how to fix them
Knowing what not to do is as important as knowing what to do. These are the most common RAG failure modes, distilled from production systems, research papers, and the RAG community. Each pattern has been observed to cause significant quality degradation in real deployments.
Retrieved chunks are fragments of different documents mashed together without coherence, losing the logical flow of information.
Cause
Fixed-size character splitting that breaks mid-sentence, no overlap between chunks, no metadata preserving document structure or section hierarchy.
Symptom
Answers contain contradictory statements from different sources mixed together. The model stitches together unrelated fragments into plausible-sounding but incorrect answers.
Fix
Use recursive or semantic chunking that respects document boundaries. Add 10-20% overlap. Preserve parent document metadata and section headers in each chunk. Consider document-level summaries alongside chunk-level retrieval.
The embedding model's training domain doesn't match your document domain, causing poor semantic similarity scores for relevant documents.
Cause
Using a general-purpose embedding model (trained on web text) for specialized domains like medical, legal, or financial documents. The model's vector space doesn't capture domain-specific semantic relationships.
Symptom
Retrieval returns semantically adjacent but factually wrong documents. Medical queries about 'hypertension treatment' retrieve docs about 'stress management' because the general model conflates the concepts.
Fix
Benchmark multiple embedding models on YOUR data before committing. Use domain-specific models when available (e.g., PubMedBERT for medical). Fine-tune embeddings on domain-specific pairs. Test with MTEB or custom eval sets.
The model generates confident answers that appear grounded in retrieved context but actually fabricate claims not present in any source document.
Cause
No faithfulness checking between the generated answer and retrieved sources. The model fills gaps in retrieved context with plausible-sounding but fabricated information, especially when context is partially relevant.
Symptom
Answers contain specific numbers, dates, or claims that sound authoritative but don't appear in any retrieved document. Users trust these answers because they're in a 'RAG system' that should be grounded.
Fix
Implement faithfulness scoring (RAGAS). Decompose answers into atomic claims and verify each against source documents. Instruct the model to say 'I don't have enough information' when context is insufficient. Add citation requirements.
The vector index contains outdated, duplicate, or irrelevant documents that dilute retrieval quality and increase costs.
Cause
No document lifecycle management. Old versions of documents coexist with new versions. Duplicate content from multiple ingestion runs. No garbage collection for deleted source documents.
Symptom
Retrieval returns outdated information alongside current data. Answers reference deprecated policies, old product features, or superseded documentation. Index costs grow linearly while quality degrades.
Fix
Implement document versioning with metadata filters. Use content hashing to prevent duplicate ingestion. Build a deletion pipeline that removes vectors when source documents are updated or removed. Schedule periodic index audits.
Sending user queries directly to the vector search without any transformation, assuming the user's phrasing will match document phrasing.
Cause
No query preprocessing, expansion, or decomposition. Users ask questions in natural language ('why is my bill so high?') while documents use formal language ('billing adjustment procedures').
Symptom
Poor retrieval recall — relevant documents exist in the index but aren't retrieved because the user's language doesn't semantically match the document language. Users report 'the system doesn't know things it should.'
Fix
Implement query transformation: HyDE (generate a hypothetical document, embed that instead), query expansion (add synonyms/related terms), query decomposition for complex questions, or step-back prompting to generalize specific queries.
Retrieving too many chunks and stuffing them all into the context, exceeding the model's effective attention span even if within token limits.
Cause
Setting topK too high (20+) without reranking or relevance filtering. Assuming more context is better. Not understanding the 'lost in the middle' phenomenon where models ignore information in the middle of long contexts.
Symptom
Model ignores key information buried among irrelevant chunks. Answer quality actually degrades as you add more retrieved documents. Performance is worse with 20 chunks than with 5 well-chosen ones.
Fix
Retrieve broadly (top-20), then rerank to top-5. Set minimum similarity thresholds. Use contextual compression to extract only the relevant sentences from each chunk. Test answer quality at different topK values to find the optimum.
Interactive Code Examples
Naive vs. production RAG patterns side by side
See RAG patterns in action. Each example shows a naive approach and its production-grade fix. Toggle between them to understand the difference.
The difference between dumping documents and structured retrieval
// BAD: Dump all documents into context
async function askQuestion(question: string) {
const allDocs = await db.collection("docs").find({}).toArray();
const response = await llm.generate({
system: "Answer the question using these docs.",
messages: [
{
role: "user",
content: `Docs: ${allDocs.map((d) => d.text).join("\n")}
Question: ${question}`,
},
],
});
return response;
}Why this fails
Dumping all documents into the context wastes tokens, causes context rot, and often exceeds window limits. The model can't distinguish relevant from irrelevant content, leading to hallucination and degraded answers.
All Examples Quick Reference
Naive RAG vs. Structured RAG
The difference between dumping documents and structured retrieval
Chunking: Fixed vs. Semantic
How you split documents determines retrieval quality
Dense-Only vs. Hybrid Search
Why vector search alone misses keyword-dependent queries
One-Shot RAG vs. Self-RAG
Letting the agent decide when and what to retrieve
No Evaluation vs. RAGAS Evaluation
How to measure if your RAG system actually works
Single Query vs. Query Decomposition
Complex questions need to be broken into sub-queries
No Cache vs. Semantic Caching
Avoid redundant embedding and retrieval calls
Best Practices Checklist
Production-ready guidelines for every RAG stage
Production-ready guidelines distilled from LlamaIndex, LangChain, Pinecone, Weaviate, and the broader RAG research community. These are the patterns that separate demo-quality RAG from systems that work reliably at scale.
Match chunk size to your use case
QA tasks work best with 256-512 token chunks. Summarization needs 1024-2048. Test chunk sizes on your actual queries — there is no universal optimum.
Always use overlap between chunks
10-20% overlap (e.g., 100 tokens for 500-token chunks) prevents information loss at chunk boundaries. This is the single highest-ROI chunking improvement.
Preserve metadata in every chunk
Include source document title, section header, page number, last updated date, and document type. Metadata enables filtering and improves answer attribution.
Consider parent-child chunk hierarchies
Index small chunks for precise retrieval but return their parent (larger) chunk for context. This gives you the best of both worlds: precise matching with sufficient context.
Use hybrid search (dense + sparse) by default
Hybrid search catches both semantic and keyword matches. Research consistently shows it outperforms either approach alone, especially for mixed query types.
Always add a reranking step
First-stage retrieval (vector search) is fast but imprecise. A cross-encoder reranker (Cohere, ColBERT, BGE) re-scores results for the final top-k selection. This typically improves precision by 15-30%.
Set minimum relevance thresholds
Don't inject documents below a similarity threshold (e.g., 0.7). Irrelevant context is worse than no context — it actively misleads the model.
Implement Maximal Marginal Relevance (MMR)
MMR balances relevance with diversity in retrieved results. Without it, you get five near-identical chunks about the same subtopic while missing other relevant information.
Require source citations in every answer
Instruct the model to cite which source each claim comes from (by number or title). This enables verification and naturally reduces hallucination because the model must ground each claim.
Implement faithfulness checking
Score generated answers against retrieved context using RAGAS faithfulness metric. Decompose answers into claims and verify each against sources. Flag answers with faithfulness below 0.8.
Tell the model when to say 'I don't know'
Explicitly instruct: 'If the provided sources don't contain enough information to answer, say so rather than guessing.' This simple instruction dramatically reduces hallucination.
Separate retrieval evaluation from generation evaluation
A wrong answer can stem from bad retrieval OR bad generation. Evaluate context precision/recall independently from answer faithfulness/relevancy to diagnose which stage is failing.
Implement semantic caching
Cache answers for semantically similar queries. In production, common questions get asked hundreds of times daily. Semantic caching can reduce RAG pipeline costs by 60-80%.
Build an incremental indexing pipeline
Don't re-index your entire corpus for every update. Use content hashing to detect changes, update only modified documents, and handle deletions. This keeps your index fresh without rebuilding from scratch.
Monitor retrieval quality in production
Track metrics like average relevance score, cache hit rate, retrieval latency (p50/p95/p99), and user feedback signals. Set alerts for relevance score drops.
Version your embeddings
When you change embedding models, you must re-embed your entire corpus. Version your indexes so you can roll back. Never mix embeddings from different models in the same index.
Start with naive RAG, add agentic layers as needed
Self-RAG, CRAG, and adaptive RAG add complexity and latency. Start simple. Add agentic retrieval only when evaluation shows naive RAG is insufficient for your queries.
Use query routing for heterogeneous data sources
If you have SQL databases, vector stores, and knowledge graphs, route queries to the right backend. An LLM classifier or keyword-based router can determine the best retrieval strategy per query.
Implement corrective RAG for high-stakes applications
For medical, legal, or financial RAG, add a document relevance check after retrieval and before generation. If retrieved docs aren't relevant enough, fall back to web search or escalate to a human.
Set max retrieval iterations for agentic patterns
Agentic RAG patterns that retry retrieval (Self-RAG, CRAG) need a circuit breaker. Set a maximum of 3 retrieval iterations to prevent infinite loops and runaway costs.
The Guiding Principle
RAG quality is a compounding function of every pipeline stage. Good chunking makes embeddings more meaningful. Good embeddings make retrieval more precise. Good retrieval makes reranking more effective. Good reranking makes generation more faithful. Investing in the first stage (chunking) has the highest ROI because it multiplies through every downstream stage.
— Adapted from LlamaIndex, Pinecone, and Weaviate best practices
Resources & Further Reading
Papers, docs, and guides
Essential research papers, documentation, and guides for building production-grade RAG systems. These are the sources referenced throughout this academy.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
The foundational Self-RAG paper by Asai et al. Introduces reflection tokens that let the model decide when to retrieve and verify its own outputs.
Corrective Retrieval Augmented Generation (CRAG)
Introduces a retrieval evaluator that scores document relevance and triggers corrective actions — knowledge refinement or web search fallback.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Clusters and summarizes documents into a tree hierarchy, enabling retrieval at different levels of abstraction for both detailed and thematic queries.
RAGAS: Automated Evaluation of Retrieval Augmented Generation
Defines the four core RAG evaluation metrics: faithfulness, answer relevancy, context precision, and context recall. The standard framework for RAG evaluation.
LlamaIndex RAG Documentation
Comprehensive documentation covering RAG pipeline components, chunking strategies, query engines, and agentic RAG patterns with LlamaIndex.
LangChain RAG Tutorial
Step-by-step RAG tutorial covering document loading, splitting, embedding, retrieval, and generation with LangChain.
Pinecone RAG Guide
Production-focused RAG guide from Pinecone covering architecture patterns, scaling considerations, and optimization strategies.
Weaviate Hybrid Search
Deep dive into hybrid search combining dense vectors with BM25 sparse retrieval, including Reciprocal Rank Fusion algorithms.
Graph RAG: Unlocking LLM Discovery on Narrative Private Data
Microsoft's Graph RAG approach that builds a knowledge graph from documents and uses community summaries for global question answering.
Chunking Strategies for LLM Applications
Practical guide comparing fixed-size, recursive, semantic, and document-aware chunking strategies with benchmarks.
Suggested Reading Order
Start with the LangChain RAG Tutorial for hands-on basics. Read the RAGAS paper to understand evaluation. Then dive into Self-RAG and CRAG for agentic patterns. Finally, explore Graph RAG for advanced cross-document synthesis.