The Complete Guide — 2025 Edition

Agentic
RAG
Academy

Go beyond naive RAG. Master agentic retrieval patterns — from chunking and embeddings to self-correcting retrieval, query planning, and production-grade architecture.

Start Learning All Academies

13 Chapters7 Code Examples6 Anti-Patterns10 Resources5 Vector DB Comparisons

RAG Fundamentals

Retrieval-Augmented Generation from first principles

Retrieval-Augmented Generation (RAG) is a technique that gives LLMs access to external knowledge at inference time. Instead of relying solely on what the model memorized during training, RAG retrieves relevant documents and injects them into the context window before generation.

The term was coined by Patrick Lewis et al. at Meta AI in 2020. The core insight: LLMs are powerful reasoning engines, but their parametric knowledge is static, incomplete, and prone to hallucination. RAG decouples knowledge storage (the retrieval index) from knowledge reasoning (the LLM), making both independently improvable.

The Problem

Why Naive RAG Fails

Most teams start with “naive RAG” — embed documents, retrieve top-k, stuff into prompt, generate. This works for demos but fails in production for predictable reasons:

Poor chunking

Splits mid-sentence, destroys context, creates incoherent fragments

No relevance filtering

Returns all top-k results even when none are relevant, injecting noise

Missing reranking

First-stage vector search is fast but imprecise — wrong docs rank high

No faithfulness check

Model hallucinates beyond the retrieved context with no verification

The RAG Pipeline

Every RAG system follows three stages. The quality of each stage compounds — poor indexing guarantees poor retrieval, which guarantees poor generation.

Stage 1

Indexing

Split documents into chunks, generate embeddings for each chunk, and store them in a vector database with metadata.

▸Load documents (PDFs, HTML, Markdown, databases)
▸Split into semantically meaningful chunks
▸Generate vector embeddings for each chunk
▸Store vectors + metadata in a vector database

Stage 2

Retrieval

When a query arrives, embed it, search the vector database for similar chunks, and return the top-k most relevant results.

▸Embed the user query into the same vector space
▸Search the vector index for nearest neighbors
▸Apply filters (metadata, relevance threshold)
▸Optionally rerank results for precision

Stage 3

Generation

Inject the retrieved chunks into the LLM's context window alongside the user query, and generate a grounded response.

▸Format retrieved chunks with source attribution
▸Construct a prompt with context + query
▸Generate a response grounded in the sources
▸Optionally verify faithfulness of the answer

RAG vs. Fine-Tuning: When to Use Each

Aspect	RAG	Fine-Tuning
Knowledge updates	Instant — update the index	Requires retraining the model
Source attribution	Natural — can cite sources	Not possible — knowledge is in weights
Cost	Per-query retrieval + generation	High upfront training, cheap inference
Hallucination	Reduced (but not eliminated)	Common — model confabulates
Best for	Factual QA, documentation, support	Style, format, behavior changes

Rule of thumb: Use RAG when the model needs to access specific, updatable facts. Use fine-tuning when you need to change the model's behavior, style, or output format. In many production systems, you use both — fine-tune for behavior, RAG for knowledge.

Key Insight

RAG doesn't eliminate hallucination — it reduces it by grounding the model in retrieved evidence. The gap between a demo RAG system and a production RAG system is the difference between “it works on my 10 test questions” and “it works reliably on 10,000 diverse user queries.” Closing that gap requires systematic chunking, retrieval, evaluation, and iteration.

“RAG is not just a technique — it's a paradigm shift. Instead of cramming knowledge into model weights, you give the model a library card.”

Jerry Liu

Co-founder & CEO, LlamaIndex

“The biggest mistake teams make with RAG is treating retrieval as a solved problem. Your RAG system is only as good as the retrieval step.”

Harrison Chase

Co-founder & CEO, LangChain

“Vector search alone is not enough. The future of RAG is hybrid search combined with learned reranking — that's where the real accuracy gains come from.”

Bob van Luijt

Co-founder & CEO, Weaviate

Chunking Strategies

Semantic, recursive, agentic chunking

Chunking is the process of splitting documents into smaller pieces for indexing and retrieval. It is the most under-appreciated step in the RAG pipeline — and the one with the highest ROI when done right. Poor chunking creates “chunk soup” — incoherent fragments that poison retrieval quality.

Critical Concept

Chunk Overlap: The Single Highest-ROI Improvement

Without overlap, information at chunk boundaries is split between two chunks. Neither chunk has the complete thought, so retrieval misses it. A 10-20% overlap (e.g., 100-200 tokens for 1000-token chunks) ensures boundary information appears in at least one complete chunk.

// Without overlap:

Chunk 1: "...The refund policy requires cust"

Chunk 2: "omers to submit within 30 days..."

// With 200-char overlap:

Chunk 1: "...The refund policy requires customers to submit within 30 days."

Chunk 2: "The refund policy requires customers to submit within 30 days. Digital products..."

Chunking Strategies Compared

BasicFixed-Size Splitting

Split text into chunks of a fixed character or token count. The simplest approach but the lowest quality.

Pros

+ Fast and deterministic
+ Easy to implement
+ Predictable chunk sizes

Cons

- Splits mid-sentence and mid-paragraph
- Destroys semantic coherence
- No awareness of document structure

// Fixed-size: chunk every 500 characters const chunks = []; for (let i = 0; i < text.length; i += 500) { chunks.push(text.slice(i, i + 500)); }

Best for: Quick prototyping when chunk quality doesn't matter yet.

RecommendedRecursive Character Splitting

Split on a hierarchy of separators (paragraphs > sentences > words), trying the largest separator first and falling back to smaller ones.

Pros

+ Respects paragraph and sentence boundaries
+ Configurable separator hierarchy
+ Good balance of quality and simplicity

Cons

- Still uses a fixed target chunk size
- May not respect semantic boundaries perfectly
- Overlap can duplicate content

const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200, separators: ["\n\n", "\n", ". ", " ", ""], });

Best for: Production RAG systems. The default choice for most use cases.

AdvancedSemantic Chunking

Use embeddings to find natural breakpoints. Embed each sentence, then split where the cosine similarity between consecutive sentences drops below a threshold.

Pros

+ Chunks align with topic boundaries
+ Variable chunk sizes that match content
+ No arbitrary size limits

Cons

- Requires embedding every sentence (slow for large corpora)
- Needs a similarity threshold to tune
- More complex to implement

// Embed each sentence, split on topic shifts const embeddings = await embedBatch(sentences); const breakpoints = []; for (let i = 1; i < embeddings.length; i++) { const sim = cosineSimilarity(embeddings[i-1], embeddings[i]); if (sim < threshold) breakpoints.push(i); }

Best for: When document structure varies widely and topic coherence is critical.

Cutting EdgeAgentic Chunking

Use an LLM to determine chunk boundaries. The model reads through the document and decides where semantically complete units begin and end.

Pros

+ Highest semantic coherence
+ Handles complex document structures
+ Can add chunk-level summaries

Cons

- Expensive — requires LLM calls per document
- Slow for large corpora
- Non-deterministic chunking

// LLM decides chunk boundaries const chunks = await llm.generate({ system: `Split this document into semantically complete chunks. Each chunk should cover one topic or concept. Return as JSON array with summary per chunk.`, messages: [{ role: "user", content: document }], });

Best for: Small, high-value document collections where chunk quality directly impacts revenue (legal, medical, financial).

SpecializedDocument-Aware Splitting

Use document format-specific parsers (Markdown headers, HTML tags, PDF sections, code AST) to split at structural boundaries.

Pros

+ Respects actual document structure
+ Headers become natural chunk boundaries
+ Preserves hierarchical context

Cons

- Requires format-specific parsers
- Sections may exceed target chunk size
- Doesn't work for unstructured text

// Split Markdown by headers const splitter = new MarkdownTextSplitter({ chunkSize: 1000, chunkOverlap: 100, }); // Preserves: # Title → ## Section → ### Subsection

Best for: Structured documents like docs sites, wikis, codebases, and technical manuals.

Chunk Size vs. Retrieval Quality

Smaller chunks are more precise (match specific facts) but have lower recall (miss surrounding context). Larger chunks provide more context but are less precise. There is no universal optimum — it depends on your use case.

Chunk Size	Precision	Recall	Best For
128 tokens	High	Low	Exact fact lookup
256 tokens	High	Medium	QA, chatbots
512 tokens	Medium	Medium	General purpose
1024 tokens	Medium	High	Summarization
2048 tokens	Low	High	Document-level context

Metadata Preservation

Every chunk should carry metadata: source document title, section header, page number, chunk index, and last updated date. Without metadata, you can't filter by recency, can't attribute sources, and can't implement document versioning. The parent-child pattern — index small chunks for precise retrieval but return the parent chunk for context — gives you the best of both worlds.

Embeddings & Vector Databases

Choosing models and storage for your use case

Free (self-hosted)

Strengths

Fully open-source with open training data. Long context (8K tokens). Matryoshka support.

Trade-offs

Lower dimensions may reduce precision for very large indices.

Choosing Embedding Dimensions

Higher dimensions capture more nuance but cost more to store and search. Modern models with Matryoshka embeddings (OpenAI, Nomic) let you truncate vectors to smaller dimensions after generation without re-embedding. This enables a key optimization:

// Matryoshka: generate at full dims, store at reduced dims const fullEmbedding = await embed(text); // 3072 dimensions const reduced = fullEmbedding.slice(0, 1024); // Use first 1024 // Quality vs. cost tradeoff: // 3072d → best quality, highest storage cost // 1536d → ~99% quality, 50% storage savings // 1024d → ~97% quality, 66% storage savings // 256d → ~90% quality, 92% storage savings

Start with full dimensions, measure quality, then reduce until you find the smallest dimension that meets your accuracy requirements.

Vector Databases Compared

PineconeManaged SaaS

Strengths

Fully managed, serverless option, excellent scaling, hybrid search support.

Best For

Teams that want zero infrastructure management. Production workloads at scale.

Considerations

Vendor lock-in. No self-hosted option. Cost scales with stored vectors.

WeaviateOpen-source / Managed

Strengths

Native hybrid search (dense + BM25), modular architecture, GraphQL API, multi-tenancy.

Best For

Hybrid search use cases. Teams wanting open-source with managed cloud option.

Considerations

Higher memory usage than some alternatives. Learning curve for module system.

ChromaOpen-source

Strengths

Simplest API. In-memory for development, persistent for production. Python-native.

Best For

Prototyping and development. Small to medium datasets. Python-first teams.

Considerations

Less mature for very large-scale production. Fewer enterprise features.

QdrantOpen-source / Managed

Strengths

Rust-based (fast). Rich filtering. Payload storage. gRPC + REST APIs.

Best For

Performance-critical applications. Complex filtering requirements.

Considerations

Smaller community than Pinecone/Weaviate. Newer managed cloud offering.

pgvector (PostgreSQL)Extension

Strengths

Uses existing Postgres infrastructure. ACID transactions. Familiar SQL interface.

Best For

Teams already on PostgreSQL who want vector search without a new database.

Considerations

Slower than purpose-built vector DBs at scale. Limited to 2000 dimensions. HNSW index tuning required.

Distance Metrics

Distance metrics define how similarity is measured between vectors. The choice of metric affects both accuracy and performance.

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude. Most common for text embeddings because document length doesn't affect similarity.

cos(A,B) = (A . B) / (|A| * |B|)Range: -1 to 1 (1 = identical)

Use when: Default for text embeddings. Use when document length varies.

Euclidean Distance (L2)

Measures the straight-line distance between two points in vector space. Sensitive to magnitude — longer documents have larger vectors.

d(A,B) = sqrt(sum((Ai - Bi)^2))Range: 0 to infinity (0 = identical)

Use when: When vectors are normalized. Image embeddings. Clustering tasks.

Dot Product (Inner Product)

Measures both angle and magnitude. Higher values indicate greater similarity. Fastest to compute but affected by vector magnitude.

A . B = sum(Ai * Bi)Range: -infinity to infinity

Use when: When magnitude matters (e.g., importance weighting). Fastest option for normalized vectors.

Key Insight

The most common mistake teams make is choosing an embedding model based on generic benchmarks (MTEB leaderboard) without testing on their own data. A model that ranks #1 on general benchmarks may rank #5 on your domain-specific queries. Always benchmark 2-3 models on a representative sample of your actual queries before committing.

Retrieval Strategies

Hybrid search, reranking, query expansion

Retrieval is the most impactful stage of the RAG pipeline. If you retrieve the wrong documents, no amount of prompt engineering will fix the generated answer. There are three fundamental retrieval approaches, and the best production systems combine them.

FoundationDense Retrieval (Vector Search)

Encode queries and documents into dense vectors using embedding models, then find nearest neighbors in vector space. The core of modern RAG systems.

How It Works

1.Embed the query into the same vector space as documents
2.Use Approximate Nearest Neighbor (ANN) algorithms (HNSW, IVF) for fast search
3.Return top-k documents ranked by cosine similarity or dot product

Strengths

Captures semantic meaning — 'car' matches 'automobile'. Handles paraphrases and conceptual queries well.

Weaknesses

Poor at exact keyword matching ('error code E-4012'). Requires quality embedding models. ANN search is approximate, not exact.

ClassicSparse Retrieval (BM25 / TF-IDF)

Traditional keyword-based retrieval using term frequency statistics. BM25 is the modern standard — it weighs term frequency, document length, and inverse document frequency.

How It Works

1.Tokenize query and documents into terms
2.Score each document based on term overlap (BM25 formula)
3.Rank by score — exact keyword matches rank highest

Strengths

Excellent for exact terms: error codes, product IDs, names, acronyms. Fast, deterministic, explainable. No embedding model needed.

Weaknesses

No semantic understanding — 'car' doesn't match 'automobile'. Misses paraphrases and conceptual queries entirely.

RecommendedHybrid Search (Dense + Sparse)

Combine dense and sparse retrieval to get the best of both worlds: semantic understanding from vectors AND exact keyword matching from BM25.

How It Works

1.Run dense retrieval (vector search) to get ranked list A
2.Run sparse retrieval (BM25) to get ranked list B
3.Fuse the two lists using Reciprocal Rank Fusion (RRF) or weighted scoring
4.Return the merged, fused results

Strengths

Consistently outperforms either approach alone in benchmarks. Catches both semantic and keyword matches. Handles diverse query types.

Weaknesses

Requires maintaining two indices (vector + inverted). Slightly higher latency. Need to tune fusion weights.

Core Algorithm

Reciprocal Rank Fusion (RRF)

RRF merges two ranked lists into a single list by scoring each document based on its rank in both lists. It's the standard fusion algorithm for hybrid search because it's simple, effective, and doesn't require score calibration between the two retrieval methods.

// Reciprocal Rank Fusion function rrf(rankings: Map<string, number>[], k = 60): Map<string, number> { const scores = new Map<string, number>(); for (const ranking of rankings) { for (const [docId, rank] of ranking) { const current = scores.get(docId) ?? 0; scores.set(docId, current + 1 / (k + rank)); } } return scores; // Sort by score descending for final ranking } // k = 60 is the standard constant from the original paper // Higher k reduces the impact of top-ranked documents

Reranking: The Precision Pass

First-stage retrieval (vector search, BM25) is optimized for speed — scanning millions of documents in milliseconds. But speed comes at the cost of precision. A reranker is a cross-encoder model that reads the query and each document together, producing a much more accurate relevance score. The pattern: retrieve broadly (top-20), rerank for precision (top-5).

Cohere Rerank

API

Commercial cross-encoder reranker. Supports 100+ languages. Simple API: pass query + documents, get reranked scores.

~25% precision improvement over vector search alone

ColBERT (v2)

Open-source

Late interaction model — encodes query and document tokens separately, then computes fine-grained similarity. Faster than traditional cross-encoders.

Near cross-encoder quality at 100x the speed

BGE Reranker (BAAI)

Open-source

Open-source cross-encoder reranker. Multiple sizes (small, base, large) for speed/quality tradeoff. Can be self-hosted.

Competitive with commercial options on English benchmarks

FlashRank

Open-source

Ultra-lightweight reranker designed for low-latency applications. Under 100MB model size. Runs on CPU.

Lower quality but <10ms latency. Good for real-time applications.

Advanced Retrieval Techniques

Maximal Marginal Relevance (MMR)

Balances relevance with diversity. Without MMR, your top-5 results might be five near-identical chunks about the same subtopic, missing other relevant information.

MMR = arg max [lambda * Sim(doc, query) - (1-lambda) * max(Sim(doc, selected_docs))]

Tip: Lambda = 0.5 is a good default. Higher lambda favors relevance, lower favors diversity.

Contextual Compression

After retrieval, extract only the relevant sentences from each chunk. A 500-token chunk might contain only 50 tokens of information relevant to the query.

retrieve(topK=10) -> compress(each chunk) -> filter(empty results) -> return

Tip: Use an LLM or extraction model to pull relevant sentences. Reduces context window usage by 60-80%.

Multi-Query Retrieval

Generate multiple reformulations of the user's query, retrieve for each, and merge results. Captures different aspects of the query's intent.

query -> [query_v1, query_v2, query_v3] -> retrieve each -> deduplicate -> fuse

Tip: Use an LLM to generate 3-5 query variants. Deduplicate results by document ID before returning.

Parent Document Retrieval

Index small chunks for precise matching, but return the larger parent chunk (or full document section) for generation context.

query -> match(small_chunks) -> return(parent_chunks)

Tip: Store child-to-parent mapping in metadata. Small chunks (128 tokens) for retrieval, parent chunks (512-1024 tokens) for context.

Key Insight

The retrieval pipeline should be: hybrid search (cast a wide net) then reranking (sort by true relevance) then MMR diversity (avoid redundancy) then relevance threshold (drop irrelevant results). Each layer tightens precision. Skipping any layer is the primary cause of poor RAG quality in production.

Agentic RAG Patterns

Self-RAG, corrective RAG, adaptive retrieval

Agentic RAG moves beyond the fixed retrieve-then-generate pipeline. Instead of blindly retrieving for every query, agentic patterns let the system decide when to retrieve, what to retrieve, and whether the retrieved context is good enough to generate a faithful answer. These patterns are the difference between demo-quality and production-quality RAG.

The Evolution of RAG

Stage 1

Naive RAG

Retrieve, stuff, generate

Stage 2

Advanced RAG

Hybrid search, reranking, filtering

Stage 3

Agentic RAG

Self-correcting, multi-step, adaptive

FoundationSelf-RAGAsai et al., 2023

Self-RAG trains the model to emit special reflection tokens that control the retrieval-generation loop. The model decides IF it needs retrieval, evaluates document relevance, generates with grounding, and self-assesses whether the answer is supported by the sources.

How It Works

1.Given a query, the model emits a [RETRIEVE] or [NO_RETRIEVE] token to decide if external knowledge is needed
2.If retrieving, the model evaluates each document with [RELEVANT] or [IRRELEVANT] tokens
3.The model generates a response using only relevant documents
4.A [SUPPORTED] or [NOT_SUPPORTED] self-assessment token verifies grounding
5.If not supported, the pipeline retries with refined retrieval

Key Benefit

30-50% reduction in hallucination compared to naive RAG. The model learns when NOT to retrieve, avoiding unnecessary latency.

Trade-off

Requires fine-tuning the model to emit reflection tokens, or prompt engineering to simulate them. Adds 2-3 LLM calls per query.

ReliabilityCorrective RAG (CRAG)Yan et al., 2024

CRAG adds a lightweight retrieval evaluator between the retrieval and generation stages. It scores document relevance and triggers one of three actions: use the documents (correct), refine the knowledge (ambiguous), or fall back to web search (incorrect).

How It Works

1.Retrieve documents using standard RAG pipeline
2.A retrieval evaluator scores each document's relevance to the query
3.If confidence is HIGH: use documents directly for generation
4.If confidence is MEDIUM: extract key sentences via knowledge refinement
5.If confidence is LOW: discard documents and fall back to web search

Key Benefit

Prevents the model from generating answers based on irrelevant context. The web search fallback handles knowledge gaps gracefully.

Trade-off

Adds latency for the evaluation step. Web search fallback requires internet access and introduces external dependency.

DynamicAdaptive RAGJeong et al., 2024

Adaptive RAG classifies incoming queries by complexity and routes them to different retrieval strategies. Simple factual questions skip retrieval entirely, moderate questions use single-step RAG, and complex questions trigger multi-step agentic retrieval.

How It Works

1.A query complexity classifier categorizes the question (simple / moderate / complex)
2.Simple queries (e.g., 'What year was X founded?') go directly to the LLM — no retrieval needed
3.Moderate queries use standard single-step RAG with reranking
4.Complex queries trigger iterative multi-step retrieval with query decomposition
5.The system dynamically allocates compute based on query difficulty

Key Benefit

Reduces average latency by 40-60% by avoiding unnecessary retrieval for simple questions. Allocates more compute to questions that need it.

Trade-off

Requires training or prompt-engineering a reliable query classifier. Misclassification can send complex queries down the simple path.

ArchitectureAgentic RAG (Tool-Based)LlamaIndex / LangChain

The agent treats retrieval as a tool it can call. Instead of a fixed pipeline, the LLM decides when to search, what to search for, and when it has enough information to answer. The retriever is one tool among many.

How It Works

1.The agent has access to tools: vector_search, web_search, sql_query, knowledge_graph, etc.
2.Given a query, the agent reasons about what information it needs
3.It calls the appropriate retrieval tool with a refined search query
4.It evaluates the results and decides if more retrieval is needed
5.It can chain multiple retrieval calls, combining results before answering

Key Benefit

Maximum flexibility — the agent can combine multiple data sources, refine queries iteratively, and handle novel question types without pipeline changes.

Trade-off

More expensive (multiple LLM calls). Harder to debug and evaluate. Requires good tool descriptions and reliable function calling.

Pattern Comparison

Pattern	Retrieval	Evaluation	Correction	Latency	Quality
Naive RAG	Always, single-step	None	None	Low	Low-Medium
Self-RAG	Conditional (model decides)	Relevance + faithfulness tokens	Retry with rephrased query	Medium-High	High
CRAG	Always, single-step	External evaluator	Web search fallback	Medium	High
Adaptive RAG	Complexity-based routing	Query classifier	Route to correct pipeline	Low-High (varies)	High
Agentic RAG	Multi-step, tool-based	Agent reasoning	Iterative refinement	High	Highest

Key Insight

Start with naive RAG, measure with RAGAS, and add agentic layers only when evaluation shows they're needed. Self-RAG is worth the complexity for high-stakes applications (medical, legal, financial). CRAG is the best bang-for-buck upgrade for most production systems. Adaptive RAG pays off when your query distribution has high variance in complexity.

Query Routing & Planning

Multi-step retrieval with query decomposition

Not every query should hit the same retrieval pipeline. Query routing directs queries to the right retrieval strategy, while query decomposition breaks complex questions into focused sub-queries for better retrieval. Together, they form the “planning” layer of agentic RAG.

Routing Patterns

SimpleQuery Classification Router

Classify the incoming query into categories and route each category to a specialized retrieval pipeline. The simplest and most reliable routing approach.

How It Works

1.Define query categories: factual, analytical, comparison, how-to, opinion
2.Use an LLM or lightweight classifier to categorize the incoming query
3.Route each category to its optimal retrieval strategy
4.Factual queries use vector search; analytical queries use SQL; comparisons use multi-query retrieval

// Route by query type const queryType = await classifier.classify(query); switch (queryType) { case "factual": return vectorSearch(query); case "analytical": return sqlQuery(query); case "comparison": return multiQueryRetrieval(query); case "how-to": return docSearch(query); }

Multi-SourceData Source Router

When your knowledge spans multiple backends (vector store, SQL database, knowledge graph, API), route the query to the appropriate data source based on the type of information needed.

How It Works

1.Register available data sources with their descriptions and capabilities
2.LLM determines which data source(s) are needed for the query
3.Route to one or more backends in parallel
4.Merge results from multiple sources into a unified context

const sources = [ { name: "docs", desc: "Product docs and guides" }, { name: "sql", desc: "User accounts and orders" }, { name: "api", desc: "Real-time pricing data" }, ]; // LLM picks: ["sql", "api"] for // "What is user X's latest order total?"

FastSemantic Router

Use embeddings to route queries without an LLM call. Pre-compute embeddings for example queries in each route, then match incoming queries to the nearest route by cosine similarity.

How It Works

1.Define routes with 5-10 example utterances each
2.Pre-compute embeddings for all example utterances
3.When a query arrives, embed it and find the nearest route centroid
4.Route to the matched pipeline — no LLM call needed, sub-10ms routing

// Pre-compute route embeddings const routes = { billing: embed(["refund", "invoice", "charge"]), technical: embed(["error", "bug", "crash"]), general: embed(["hello", "help", "info"]), }; // Match query to nearest route centroid

Query Decomposition Strategies

A single embedding cannot capture all dimensions of a complex, multi-faceted query. Decomposition breaks queries into focused sub-queries, each producing a targeted embedding that matches the right documents.

Sequential Decomposition

Break a complex query into a chain of sub-queries where each step depends on the previous result. Best for queries with logical dependencies.

// "What is the market cap of the company that acquired Twitter?" // Step 1: "Which company acquired Twitter?" → X Corp (Elon Musk) // Step 2: "What is X Corp's market cap?" → (private company) // Step 3: Synthesize answer from both steps

Best for: Multi-hop reasoning questions that require chaining facts together.

Parallel Decomposition

Break a query into independent sub-queries that can be retrieved in parallel, then merge results. Best for comparison and multi-faceted questions.

// "Compare Pinecone and Weaviate for a 10M doc use case" // Parallel: // "Pinecone pricing for 10M documents" // "Weaviate pricing for 10M documents" // "Pinecone performance benchmarks at scale" // "Weaviate performance benchmarks at scale"

Best for: Comparison questions, multi-faceted queries, questions covering multiple topics.

Step-Back Prompting

Instead of directly retrieving for a specific query, first generate a more general (abstracted) query, retrieve for that, then use the broader context to answer the specific question.

// Specific: "Why did the 2008 financial crisis cause // Bear Stearns to collapse?" // Step-back: "What were the causes and effects of the // 2008 financial crisis on investment banks?" // The broader retrieval provides more complete context

Best for: Highly specific questions where relevant documents use more general language.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the query, embed that answer instead of the query, and use it for retrieval. The hypothetical answer is closer in embedding space to the actual documents.

// Query: "How to handle rate limiting in APIs?" // Generate hypothetical answer: // "Rate limiting can be handled using exponential // backoff with jitter. Common patterns include // the token bucket algorithm and sliding window..." // Embed this hypothetical answer → better document matches

Best for: Questions where the user's phrasing differs significantly from document language.

The Full Multi-Step Retrieval Pipeline

In a production agentic RAG system, query routing and decomposition are just two steps in a six-stage pipeline. Each stage transforms the query or its results to maximize answer quality.

Query Analysis

Classify complexity, identify intent, detect entities

Query type + complexity score + entities

Route Selection

Pick the right retrieval strategy and data sources

Selected pipeline + data sources

Query Transformation

Decompose, expand, or rephrase for better retrieval

Optimized sub-queries

Parallel Retrieval

Execute sub-queries across selected data sources

Raw retrieved documents per sub-query

Result Fusion

Merge, deduplicate, rerank, and filter results

Final ranked context documents

Generation

Generate answer with faithfulness self-check

Grounded answer with citations

Key Insight

Query routing and decomposition have the highest ROI for complex, multi-faceted questions. If your queries are simple factual lookups, they add unnecessary latency. Profile your query distribution first: if more than 30% of queries are multi-hop or comparative, invest in query planning. For the rest, a semantic router (no LLM call) is sufficient.

Evaluating RAG Systems

Faithfulness, relevance, and answer quality

You cannot improve what you cannot measure. RAG evaluation requires separating retrieval quality from generation quality — a wrong answer could stem from bad retrieval (wrong documents) or bad generation (hallucination despite good documents). The RAGAS framework provides four metrics that isolate each failure mode.

The Problem

Why “Vibes-Based” Evaluation Fails

Manual spot-checking creates a dangerous illusion of quality. A change that improves 10 queries but silently breaks 50 others goes undetected. Without systematic evaluation, you cannot tell if a new chunking strategy, embedding model, or prompt change actually improved your system. Evaluation is not optional — it is the foundation of reliable RAG.

The Four RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) decomposes RAG quality into four orthogonal dimensions. Together, they tell you exactly where your system is failing.

FaithfulnessIs the answer grounded in the retrieved context?

How It's Computed

Decompose the answer into atomic claims. For each claim, check if it can be inferred from the provided context. Score = (supported claims) / (total claims).

Range: 0 to 1 (1 = every claim is supported)

What It Catches

Hallucination — the model generating facts not present in any retrieved document.

Diagnostic Signal

Low faithfulness with high context recall = generation problem. The model has the right context but fabricates beyond it.

Answer RelevancyDoes the answer actually address the question?

How It's Computed

Generate N questions from the answer using an LLM. Compute the mean cosine similarity between these generated questions and the original question. Higher similarity = more relevant answer.

Range: 0 to 1 (1 = answer perfectly addresses the question)

What It Catches

Off-topic answers — the model using retrieved context to generate an answer about something adjacent but not what was asked.

Diagnostic Signal

Low relevancy with high faithfulness = the retrieved context was relevant to a different aspect than what the user asked about.

Context PrecisionAre the top-ranked retrieved documents actually relevant?

How It's Computed

For each retrieved document, check if it is relevant to the ground truth answer. Weight by rank position — relevant documents ranked higher contribute more to the score.

Range: 0 to 1 (1 = all relevant docs ranked at the top)

What It Catches

Noisy retrieval — irrelevant documents ranked above relevant ones, wasting context window space.

Diagnostic Signal

Low precision = retrieval is returning too many irrelevant documents. Improve your reranking or relevance thresholds.

Context RecallDid we retrieve all the documents needed to answer correctly?

How It's Computed

Decompose the ground truth answer into claims. For each claim, check if it can be attributed to any retrieved document. Score = (attributed claims) / (total ground truth claims).

Range: 0 to 1 (1 = all needed information was retrieved)

What It Catches

Missing retrieval — relevant documents exist in the index but were not retrieved for this query.

Diagnostic Signal

Low recall with high precision = you are retrieving well but not enough. Increase topK or use query expansion.

Diagnostic Matrix: Where Is the Problem?

Symptom	Metrics	Fix
Hallucinated facts	Low faithfulness, high context recall	Improve generation prompt. Add citation requirements. Use a more capable model.
Off-topic answers	Low relevancy, high faithfulness	Improve query transformation or retrieval strategy. The model answers what it has, not what was asked.
Missing information	Low context recall, high precision	Increase topK. Use query expansion or decomposition. Check if documents exist in the index.
Noisy retrieval	Low context precision, high recall	Add reranking. Raise relevance thresholds. Improve chunking quality.
Total failure	Low across all metrics	Fundamental pipeline issue. Check embedding model, chunking strategy, and index freshness.

The RAG Evaluation Pipeline

Step 1

Build a Golden Dataset

Create 50-200 question-answer pairs with ground truth answers. Include easy, medium, and hard questions. Cover edge cases and common failure modes.

▸Start with real user questions from production logs
▸Have domain experts write ground truth answers
▸Include questions with no answer in the corpus (to test abstention)
▸Version your dataset and grow it over time

Step 2

Run the RAG Pipeline

For each question in the golden dataset, run your full RAG pipeline and capture: the generated answer, retrieved context documents, and any intermediate outputs.

▸Log the full pipeline trace (query, retrieved docs, reranked docs, answer)
▸Capture latency and token usage per step
▸Run in a reproducible environment (same model version, same index)

Step 3

Score with RAGAS Metrics

Compute faithfulness, answer relevancy, context precision, and context recall for each question. Aggregate across the full dataset.

▸Set target thresholds: faithfulness > 0.85, relevancy > 0.80
▸Break down scores by question category to find systematic weaknesses
▸Track metrics over time to detect regressions

Step 4

Diagnose and Iterate

Use the metric breakdown to identify whether problems are in retrieval or generation. Fix the weakest link first.

▸Low context precision/recall = fix retrieval (chunking, embeddings, search)
▸Low faithfulness = fix generation (prompt, model, or add citation requirements)
▸Low relevancy = fix query transformation or system prompt

Evaluation Frameworks

TruLens

Open-source RAG evaluation with feedback functions for groundedness, relevance, and toxicity. Integrates with LlamaIndex and LangChain.

Best for: Teams already using LlamaIndex or LangChain who want quick integration.

DeepEval

Pytest-like framework for LLM evaluation. Supports RAG-specific metrics, unit tests for LLM outputs, and CI/CD integration.

Best for: Teams wanting to run RAG evals in CI/CD pipelines alongside unit tests.

Phoenix (Arize)

Observability platform with RAG-specific tracing, evaluation, and debugging. Visualize retrieval quality and identify failure patterns.

Best for: Production monitoring and debugging of RAG systems at scale.

promptfoo

CLI-based eval framework. Define test cases in YAML, run against multiple RAG configurations, compare results side-by-side.

Best for: A/B testing different RAG configurations (chunking sizes, models, topK values).

Key Insight

The single most impactful thing you can do for your RAG system is build a golden evaluation dataset of 50-100 questions with ground truth answers. Run RAGAS metrics after every change. Set minimum thresholds (faithfulness > 0.85) and block deployments that regress. This is the RAG equivalent of unit tests — and like unit tests, the earlier you start, the more pain you avoid.

Production RAG Architecture

Scaling, caching, and real-time indexing

The gap between a RAG demo and a production RAG system is operational engineering: caching, indexing pipelines, monitoring, cost optimization, and scaling. A system that works on 100 test queries must work reliably on 100,000 diverse production queries per day — with predictable latency, cost, and quality.

Cost ReductionSemantic Caching

Cache answers for semantically similar queries. Common questions get asked hundreds of times daily in production — semantic caching catches paraphrases that exact-match caching misses.

How It Works

1.Embed the incoming query
2.Search the cache index for queries with similarity > 0.95
3.If a cache hit with valid TTL exists, return the cached answer
4.On cache miss, run the full RAG pipeline
5.Store the query embedding, answer, and sources in the cache with a TTL

Impact

60-80% cost reduction in production. Reduces average latency from 2-5s to <100ms for cache hits.

Consideration

Set appropriate TTL (1-24 hours). Invalidate cache when source documents change. Monitor cache hit rate.

FreshnessIncremental Indexing

Instead of re-indexing the entire corpus when documents change, detect changes and update only the affected vectors. Keeps the index fresh without full rebuilds.

How It Works

1.Hash each document's content at ingestion time
2.On update, compare content hashes to detect which documents changed
3.Delete old vectors for changed documents
4.Re-chunk, re-embed, and upsert only the changed documents
5.Handle deletions by removing vectors for deleted source documents

Impact

Index updates go from hours (full re-index) to minutes (incremental). Source freshness improves from daily to near real-time.

Consideration

Maintain a document-to-vector mapping for deletion. Use content hashing, not timestamps, to detect changes.

PerformanceTiered Retrieval

Use multiple retrieval tiers: a fast, approximate first pass (HNSW) followed by an exact re-scoring pass. For extremely large indices, add a pre-filter stage using metadata.

How It Works

1.Tier 1: Metadata pre-filter (namespace, date range, document type) — milliseconds
2.Tier 2: ANN search (HNSW) on filtered subset — 10-50ms for millions of vectors
3.Tier 3: Cross-encoder reranking on top-20 results — 50-200ms
4.Tier 4: Contextual compression to extract relevant sentences — 100-500ms

Impact

Handles 100M+ vector indices with sub-second latency. Each tier narrows the search space for the next.

Consideration

Profile each tier's latency. Set timeouts. Use async/parallel where possible. The reranking tier is usually the bottleneck.

Production Indexing Architecture

A production RAG system is not just a vector database. It is a pipeline of components that must work together reliably.

Document Watcher

Monitors source systems (S3, databases, wikis, git repos) for changes. Emits events for created, updated, and deleted documents.

Change Data Capture, webhooks, file watchers, cron polling

Chunking Pipeline

Receives raw documents, applies format-specific parsing, runs the chunking strategy, and enriches chunks with metadata.

LangChain splitters, LlamaIndex node parsers, custom parsers

Embedding Service

Batches chunks and generates embeddings. Rate-limited to respect API quotas. Supports multiple embedding models for A/B testing.

OpenAI API, Cohere, self-hosted models, batch processing

Vector Store

Stores vectors with metadata, supports ANN search, handles upserts and deletions. The primary retrieval backend.

Pinecone, Weaviate, Qdrant, pgvector

Cache Layer

Semantic cache for query-answer pairs. Exact cache for frequent identical queries. Invalidated on source document changes.

Redis, dedicated vector index, in-memory LRU

Monitoring & Alerts

Tracks indexing lag, retrieval latency (p50/p95/p99), cache hit rate, embedding costs, and RAGAS metric trends.

Prometheus, Grafana, Datadog, custom dashboards

Operational Metrics to Track

Metric	Target	Why
Retrieval Latency (p50)	< 100ms	User-perceived speed. p50 represents the typical experience.
Retrieval Latency (p99)	< 500ms	Tail latency. Worst-case user experience. Set alerts here.
Cache Hit Rate	> 40%	Higher = more cost savings. Below 20% means caching isn't helping.
Index Freshness	< 15 min lag	Time between source document change and index update.
Faithfulness Score	> 0.85	RAGAS faithfulness on production samples. Below 0.8 = hallucination risk.
Context Precision	> 0.75	Are retrieved documents actually relevant? Tracks retrieval quality.
Embedding Cost / 1K queries	Budget-dependent	Track cost per query to detect inefficiencies or budget overruns.
Error Rate	< 0.1%	Failed retrievals, timeouts, or pipeline errors.

Cost Optimization Strategies

Reduce embedding dimensions

50-92% storage

Use Matryoshka embeddings to reduce from 3072d to 1024d or 256d with minimal quality loss. Measure before committing.

Semantic caching

60-80% LLM cost

Cache answers for similar queries. The highest-ROI optimization for production RAG systems with repeated query patterns.

Batch embedding

30-50% embedding cost

Batch chunks during indexing instead of embedding one at a time. Most APIs offer batch discounts or higher throughput.

Tiered models

40-70% LLM cost

Use a small, fast model (GPT-4o-mini, Claude Haiku) for simple queries and route complex queries to larger models.

Contextual compression

30-60% token cost

Extract only relevant sentences from retrieved chunks before sending to the LLM. Reduces input tokens significantly.

Key Insight

The three highest-ROI production investments are: (1) semantic caching — reduces cost and latency for repeated queries, (2) incremental indexing — keeps your knowledge base fresh, and (3) RAGAS monitoring — catches quality regressions before users do. Start with these three before optimizing anything else.

Advanced Patterns

Graph RAG, multimodal RAG, conversational RAG

Beyond standard agentic RAG, advanced patterns tackle specialized challenges: retrieving over knowledge graphs, handling images and tables, maintaining multi-turn conversations, and building hierarchical document representations. These patterns address the long tail of RAG failures that basic pipelines cannot solve.

Graph RAG

Graph RAG (Microsoft Research, 2024) builds a knowledge graph from your documents and uses graph-based retrieval alongside vector search. It excels at global, thematic questions that require synthesizing information across many documents — a task where standard RAG fails.

How It Works

1.Extract entities and relationships from documents using an LLM
2.Build a knowledge graph connecting entities across the entire corpus
3.Detect communities (clusters of related entities) using graph algorithms
4.Generate summaries for each community at multiple levels of abstraction
5.At query time, traverse the graph to retrieve relevant community summaries

Strengths

+ Answers global questions ('What are the main themes across all documents?')
+ Captures relationships that vector search misses
+ Provides multi-hop reasoning paths
+ Community summaries reduce context window usage

Weaknesses

- Expensive to build — requires many LLM calls for entity extraction
- Graph construction is slow for large corpora
- Requires maintenance as documents change
- Overkill for simple factual QA

RAPTOR: Hierarchical Retrieval

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds a hierarchical tree of document summaries. Leaf nodes are individual chunks, intermediate nodes are cluster summaries, and the root captures the full corpus theme.

Level	Content	Granularity	Best For
Leaf	Individual document chunks (original text)	Most specific	Detailed factual questions
Cluster	Summaries of related chunks (5-10 chunks per cluster)	Medium	Topic-level questions
Section	Summaries of related clusters	Broad	Thematic questions
Root	Summary of the entire corpus	Broadest	Global overview questions

Multimodal RAG

Multimodal RAG extends retrieval to images, tables, charts, and diagrams alongside text. Critical for domains where knowledge is encoded visually — technical documentation, medical imaging, financial reports with charts, and slide decks.

Approaches (from simple to sophisticated)

Text-Only Extraction

Convert images/tables to text descriptions, embed as text. Simple but lossy — complex diagrams lose information.

Quality: Low-MediumCost: Low

Multimodal Embeddings

Use models like CLIP or Jina CLIP to embed images and text into the same vector space. Query with text, retrieve images.

Quality: MediumCost: Medium

Vision LLM Summarization

Use GPT-4V or Claude to generate detailed descriptions of images/charts. Embed these rich descriptions alongside source images.

Quality: HighCost: High

Native Multimodal RAG

Store images as-is and pass them directly to a multimodal LLM (GPT-4V, Claude) alongside text chunks in the context window.

Quality: HighestCost: Highest

Conversational RAG

Conversational RAG handles multi-turn conversations where follow-up questions reference previous context. The key challenge: 'Tell me more about that' requires resolving 'that' to the subject from the previous turn before retrieval.

Key Challenges

Coreference Resolution

'What about their pricing?' — 'their' refers to a company mentioned 3 turns ago. The retrieval query must resolve the reference before embedding.

Solution

Rewrite the query to be standalone: 'What is [Company X] pricing?' using the conversation history.

Context Accumulation

Over multiple turns, the user builds up a complex information need. The final query only makes sense in context of the full conversation.

Solution

Maintain a running conversation summary. Use it to enrich each query with accumulated context.

Topic Drift

The user gradually shifts topics. The retrieval system must detect when the topic changes and avoid retrieving context from the old topic.

Solution

Track topic boundaries in the conversation. Reset retrieval context when a new topic begins.

Conversational RAG Pipeline

1.Receive follow-up question in conversation context
2.Rewrite the question as a standalone query using conversation history
3.Retrieve using the standalone query (not the original follow-up)
4.Generate answer with both retrieved context and conversation history
5.Update conversation summary for the next turn

Emerging Patterns

Late Chunking

Emerging

Embed the full document first using a long-context embedding model, then split the embedding into chunk-level representations. Each chunk's embedding is contextualized by the full document.

Speculative RAG

Research

Generate multiple draft answers from different subsets of retrieved documents using a smaller model. A larger model then verifies and selects the best draft. Reduces hallucination while managing cost.

RAG + Fine-Tuning Hybrid

Production

Fine-tune a model to be a better RAG consumer: follow citations more strictly, abstain when context is insufficient, and maintain output format. The fine-tuning improves RAG behavior, not knowledge.

Agentic Indexing

Emerging

Use an LLM agent to actively manage the index: identify knowledge gaps, suggest new documents to ingest, detect outdated content, and generate synthetic QA pairs for under-covered topics.

Key Insight

Advanced patterns solve real problems, but they add complexity. Graph RAG is worth it when you need cross-document synthesis (enterprise knowledge bases, research corpora). Multimodal RAG is essential when your knowledge includes images, charts, or tables. Conversational RAG is required for any chatbot. In every case, start with the simplest pattern that meets your needs and upgrade based on evaluation metrics, not intuition.

Anti-Patterns & Failure Modes

The most common RAG mistakes and how to fix them

Knowing what not to do is as important as knowing what to do. These are the most common RAG failure modes, distilled from production systems, research papers, and the RAG community. Each pattern has been observed to cause significant quality degradation in real deployments.

CriticalHighMedium

criticalChunk Soup

Retrieved chunks are fragments of different documents mashed together without coherence, losing the logical flow of information.

Cause

Fixed-size character splitting that breaks mid-sentence, no overlap between chunks, no metadata preserving document structure or section hierarchy.

Symptom

Answers contain contradictory statements from different sources mixed together. The model stitches together unrelated fragments into plausible-sounding but incorrect answers.

Fix

Use recursive or semantic chunking that respects document boundaries. Add 10-20% overlap. Preserve parent document metadata and section headers in each chunk. Consider document-level summaries alongside chunk-level retrieval.

highEmbedding Mismatch

The embedding model's training domain doesn't match your document domain, causing poor semantic similarity scores for relevant documents.

Cause

Using a general-purpose embedding model (trained on web text) for specialized domains like medical, legal, or financial documents. The model's vector space doesn't capture domain-specific semantic relationships.

Symptom

Retrieval returns semantically adjacent but factually wrong documents. Medical queries about 'hypertension treatment' retrieve docs about 'stress management' because the general model conflates the concepts.

Fix

Benchmark multiple embedding models on YOUR data before committing. Use domain-specific models when available (e.g., PubMedBERT for medical). Fine-tune embeddings on domain-specific pairs. Test with MTEB or custom eval sets.

criticalRetrieval Hallucination

The model generates confident answers that appear grounded in retrieved context but actually fabricate claims not present in any source document.

Cause

No faithfulness checking between the generated answer and retrieved sources. The model fills gaps in retrieved context with plausible-sounding but fabricated information, especially when context is partially relevant.

Symptom

Answers contain specific numbers, dates, or claims that sound authoritative but don't appear in any retrieved document. Users trust these answers because they're in a 'RAG system' that should be grounded.

Fix

Implement faithfulness scoring (RAGAS). Decompose answers into atomic claims and verify each against source documents. Instruct the model to say 'I don't have enough information' when context is insufficient. Add citation requirements.

highIndex Bloat

The vector index contains outdated, duplicate, or irrelevant documents that dilute retrieval quality and increase costs.

Cause

No document lifecycle management. Old versions of documents coexist with new versions. Duplicate content from multiple ingestion runs. No garbage collection for deleted source documents.

Symptom

Retrieval returns outdated information alongside current data. Answers reference deprecated policies, old product features, or superseded documentation. Index costs grow linearly while quality degrades.

Fix

Implement document versioning with metadata filters. Use content hashing to prevent duplicate ingestion. Build a deletion pipeline that removes vectors when source documents are updated or removed. Schedule periodic index audits.

highQuery Naivety

Sending user queries directly to the vector search without any transformation, assuming the user's phrasing will match document phrasing.

Cause

No query preprocessing, expansion, or decomposition. Users ask questions in natural language ('why is my bill so high?') while documents use formal language ('billing adjustment procedures').

Symptom

Poor retrieval recall — relevant documents exist in the index but aren't retrieved because the user's language doesn't semantically match the document language. Users report 'the system doesn't know things it should.'

Fix

Implement query transformation: HyDE (generate a hypothetical document, embed that instead), query expansion (add synonyms/related terms), query decomposition for complex questions, or step-back prompting to generalize specific queries.

mediumContext Window Stuffing

Retrieving too many chunks and stuffing them all into the context, exceeding the model's effective attention span even if within token limits.

Cause

Setting topK too high (20+) without reranking or relevance filtering. Assuming more context is better. Not understanding the 'lost in the middle' phenomenon where models ignore information in the middle of long contexts.

Symptom

Model ignores key information buried among irrelevant chunks. Answer quality actually degrades as you add more retrieved documents. Performance is worse with 20 chunks than with 5 well-chosen ones.

Fix

Retrieve broadly (top-20), then rerank to top-5. Set minimum similarity thresholds. Use contextual compression to extract only the relevant sentences from each chunk. Test answer quality at different topK values to find the optimum.

Interactive Code Examples

Naive vs. production RAG patterns side by side

See RAG patterns in action. Each example shows a naive approach and its production-grade fix. Toggle between them to understand the difference.

RAG PipelineNaive RAG vs. Structured RAG

The difference between dumping documents and structured retrieval

Naive: stuff everything into the prompt

// BAD: Dump all documents into context
async function askQuestion(question: string) {
  const allDocs = await db.collection("docs").find({}).toArray();

  const response = await llm.generate({
    system: "Answer the question using these docs.",
    messages: [
      {
        role: "user",
        content: `Docs: ${allDocs.map((d) => d.text).join("\n")}
                  Question: ${question}`,
      },
    ],
  });

  return response;
}

Why this fails

Dumping all documents into the context wastes tokens, causes context rot, and often exceeds window limits. The model can't distinguish relevant from irrelevant content, leading to hallucination and degraded answers.

All Examples Quick Reference

RAG Pipeline

Naive RAG vs. Structured RAG

The difference between dumping documents and structured retrieval

Chunking

Chunking: Fixed vs. Semantic

How you split documents determines retrieval quality

Retrieval

Dense-Only vs. Hybrid Search

Why vector search alone misses keyword-dependent queries

Agentic Patterns

One-Shot RAG vs. Self-RAG

Letting the agent decide when and what to retrieve

Evaluation

No Evaluation vs. RAGAS Evaluation

How to measure if your RAG system actually works

Query Planning

Single Query vs. Query Decomposition

Complex questions need to be broken into sub-queries

Production

No Cache vs. Semantic Caching

Avoid redundant embedding and retrieval calls

Best Practices Checklist

Production-ready guidelines for every RAG stage

Production-ready guidelines distilled from LlamaIndex, LangChain, Pinecone, Weaviate, and the broader RAG research community. These are the patterns that separate demo-quality RAG from systems that work reliably at scale.

Chunking & Indexing

Match chunk size to your use case

QA tasks work best with 256-512 token chunks. Summarization needs 1024-2048. Test chunk sizes on your actual queries — there is no universal optimum.

Always use overlap between chunks

10-20% overlap (e.g., 100 tokens for 500-token chunks) prevents information loss at chunk boundaries. This is the single highest-ROI chunking improvement.

Preserve metadata in every chunk

Include source document title, section header, page number, last updated date, and document type. Metadata enables filtering and improves answer attribution.

Consider parent-child chunk hierarchies

Index small chunks for precise retrieval but return their parent (larger) chunk for context. This gives you the best of both worlds: precise matching with sufficient context.

Retrieval Quality

Use hybrid search (dense + sparse) by default

Hybrid search catches both semantic and keyword matches. Research consistently shows it outperforms either approach alone, especially for mixed query types.

Always add a reranking step

First-stage retrieval (vector search) is fast but imprecise. A cross-encoder reranker (Cohere, ColBERT, BGE) re-scores results for the final top-k selection. This typically improves precision by 15-30%.

Set minimum relevance thresholds

Don't inject documents below a similarity threshold (e.g., 0.7). Irrelevant context is worse than no context — it actively misleads the model.

Implement Maximal Marginal Relevance (MMR)

MMR balances relevance with diversity in retrieved results. Without it, you get five near-identical chunks about the same subtopic while missing other relevant information.

Generation & Faithfulness

Require source citations in every answer

Instruct the model to cite which source each claim comes from (by number or title). This enables verification and naturally reduces hallucination because the model must ground each claim.

Implement faithfulness checking

Score generated answers against retrieved context using RAGAS faithfulness metric. Decompose answers into claims and verify each against sources. Flag answers with faithfulness below 0.8.

Tell the model when to say 'I don't know'

Explicitly instruct: 'If the provided sources don't contain enough information to answer, say so rather than guessing.' This simple instruction dramatically reduces hallucination.

Separate retrieval evaluation from generation evaluation

A wrong answer can stem from bad retrieval OR bad generation. Evaluate context precision/recall independently from answer faithfulness/relevancy to diagnose which stage is failing.

Production Operations

Implement semantic caching

Cache answers for semantically similar queries. In production, common questions get asked hundreds of times daily. Semantic caching can reduce RAG pipeline costs by 60-80%.

Build an incremental indexing pipeline

Don't re-index your entire corpus for every update. Use content hashing to detect changes, update only modified documents, and handle deletions. This keeps your index fresh without rebuilding from scratch.

Monitor retrieval quality in production

Track metrics like average relevance score, cache hit rate, retrieval latency (p50/p95/p99), and user feedback signals. Set alerts for relevance score drops.

Version your embeddings

When you change embedding models, you must re-embed your entire corpus. Version your indexes so you can roll back. Never mix embeddings from different models in the same index.

Agentic RAG Patterns

Start with naive RAG, add agentic layers as needed

Self-RAG, CRAG, and adaptive RAG add complexity and latency. Start simple. Add agentic retrieval only when evaluation shows naive RAG is insufficient for your queries.

Use query routing for heterogeneous data sources

If you have SQL databases, vector stores, and knowledge graphs, route queries to the right backend. An LLM classifier or keyword-based router can determine the best retrieval strategy per query.

Implement corrective RAG for high-stakes applications

For medical, legal, or financial RAG, add a document relevance check after retrieval and before generation. If retrieved docs aren't relevant enough, fall back to web search or escalate to a human.

Set max retrieval iterations for agentic patterns

Agentic RAG patterns that retry retrieval (Self-RAG, CRAG) need a circuit breaker. Set a maximum of 3 retrieval iterations to prevent infinite loops and runaway costs.

The Guiding Principle

RAG quality is a compounding function of every pipeline stage. Good chunking makes embeddings more meaningful. Good embeddings make retrieval more precise. Good retrieval makes reranking more effective. Good reranking makes generation more faithful. Investing in the first stage (chunking) has the highest ROI because it multiplies through every downstream stage.

— Adapted from LlamaIndex, Pinecone, and Weaviate best practices

Resources & Further Reading

Graph RAG: Unlocking LLM Discovery on Narrative Private Data

Microsoft's Graph RAG approach that builds a knowledge graph from documents and uses community summaries for global question answering.

guidePinecone

Chunking Strategies for LLM Applications

Practical guide comparing fixed-size, recursive, semantic, and document-aware chunking strategies with benchmarks.