Skip to content

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an architecture that enhances LLMs by retrieving relevant information from external knowledge bases and incorporating it into the generation process. RAG addresses fundamental LLM limitations: knowledge cutoff dates, hallucination, lack of access to private/domain-specific data, and the inability to cite sources.

Rather than relying solely on knowledge encoded in model weights during pre-training, RAG systems dynamically fetch relevant context at inference time, grounding model responses in verifiable, up-to-date information. This chapter covers the complete RAG pipeline—from document ingestion and embedding to retrieval strategies, advanced patterns, and evaluation.


1. Why RAG?

The Problem with Parametric Knowledge

LLMs store knowledge in their parameters (weights) during pre-training. This "parametric knowledge" has fundamental limitations:

  • Knowledge Cutoff: Models only know what was in their training data. A model trained in 2024 doesn't know about events in 2025.
  • Hallucination: When asked about topics outside their training data (or at the boundary of their knowledge), models confabulate plausible-sounding but incorrect answers.
  • No Private Data: Models cannot access proprietary databases, internal documents, or user-specific information.
  • No Verifiability: You cannot trace a parametric answer back to a source document.
  • Expensive Updates: Updating parametric knowledge requires retraining or fine-tuning (costly and slow).

RAG as a Solution

RAG combines the strengths of retrieval systems (precise, up-to-date, verifiable) with generative models (fluent, reasoning-capable, flexible):

Aspect Pure LLM Pure Retrieval RAG
Fluency Excellent N/A (returns documents) Excellent
Accuracy Varies (hallucination risk) High (if relevant doc exists) High (grounded in sources)
Up-to-date No (knowledge cutoff) Yes (live index) Yes
Private Data No Yes Yes
Citable No Yes Yes
Reasoning Yes No Yes

2. RAG Architecture Overview

Core Workflow

                    ┌─────────────────────────────────────────────┐
                    │              INDEXING (Offline)               │
                    │                                              │
                    │  Documents → Chunking → Embedding → VectorDB │
                    └─────────────────────────────────────────────┘

                    ┌─────────────────────────────────────────────┐
                    │              QUERYING (Online)                │
                    │                                              │
                    │  User Query → Embed → Retrieve → Augment     │
                    │            → Generate Response                │
                    └─────────────────────────────────────────────┘

Phase 1: Indexing (Offline) 1. Load documents from various sources (PDFs, web pages, databases, APIs). 2. Chunk documents into manageable pieces. 3. Embed each chunk into a dense vector using an embedding model. 4. Store vectors + metadata in a vector database.

Phase 2: Querying (Online) 1. Embed the user query using the same embedding model. 2. Retrieve the top-K most similar chunks from the vector database. 3. Augment the LLM prompt with retrieved context. 4. Generate a response grounded in the retrieved information.

Pseudocode (Complete RAG System)

class RAGSystem:
    embedding_model: EmbeddingModel
    vector_store: VectorDatabase
    llm: LanguageModel
    chunker: DocumentChunker
    reranker: Reranker = None

    // ---- INDEXING ----

    function ingest(documents: list[Document])
        for doc in documents:
            // 1. Chunk the document
            chunks = self.chunker.chunk(doc.text, metadata=doc.metadata)

            // 2. Generate embeddings
            embeddings = self.embedding_model.encode_batch([c.text for c in chunks])

            // 3. Store in vector database
            self.vector_store.upsert(
                ids=[c.id for c in chunks],
                vectors=embeddings,
                metadata=[c.metadata for c in chunks],
                texts=[c.text for c in chunks]
            )

    // ---- QUERYING ----

    function query(user_query: str, top_k: int = 5) -> RAGResponse
        // 1. Embed the query
        query_vector = self.embedding_model.encode(user_query)

        // 2. Retrieve relevant chunks
        results = self.vector_store.search(query_vector, top_k=top_k * 3)

        // 3. Optional: re-rank for precision
        if self.reranker:
            results = self.reranker.rerank(user_query, results, top_k=top_k)
        else:
            results = results[:top_k]

        // 4. Build augmented prompt
        context = self.format_context(results)
        prompt = self.build_prompt(user_query, context)

        // 5. Generate response
        response = self.llm.generate(prompt)

        return RAGResponse(
            answer=response,
            sources=results,
            confidence=self.estimate_confidence(results)
        )

    function build_prompt(query: str, context: str) -> str
        return f"""Use the following context to answer the question.
If the context doesn't contain the answer, say "I don't have enough information."
Always cite your sources.

Context:
{context}

Question: {query}

Answer:"""

3. Document Ingestion Pipeline

Document Loading

RAG systems must handle diverse document formats:

Format Challenges Tools
PDF Layout extraction, tables, images, scanned docs PyPDF2, pdfplumber, Unstructured, Adobe Extract API
HTML/Web Boilerplate removal, dynamic content Beautiful Soup, Trafilatura, Playwright
Markdown Structure preservation, code blocks Standard parsers
Word/PowerPoint Complex formatting, embedded objects python-docx, python-pptx, Unstructured
Code Syntax awareness, function boundaries Tree-sitter, language-specific parsers
Databases Schema understanding, query generation SQLAlchemy, custom connectors
APIs Rate limiting, pagination, auth Custom connectors

Chunking Strategies

Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval.

Fixed-Size Chunking

Split text into chunks of a fixed number of tokens/characters:

class FixedSizeChunker:
    chunk_size: int = 512     // tokens
    chunk_overlap: int = 50   // tokens overlap between chunks

    function chunk(text: str) -> list[Chunk]
        tokens = tokenize(text)
        chunks = []
        start = 0
        while start < len(tokens):
            end = min(start + self.chunk_size, len(tokens))
            chunk_text = detokenize(tokens[start:end])
            chunks.append(Chunk(text=chunk_text, start=start, end=end))
            start += self.chunk_size - self.chunk_overlap
        return chunks

Pros: Simple, predictable chunk sizes, easy to implement. Cons: May split sentences, paragraphs, or semantic units mid-thought.

Guidelines: - Chunk size: 256-1024 tokens. Smaller chunks = more precise retrieval but less context. Larger chunks = more context but more noise. - Overlap: 10-20% of chunk size. Preserves context at boundaries.

Semantic Chunking

Split at natural boundaries (sentences, paragraphs, sections):

class SemanticChunker:
    max_chunk_size: int = 1024
    min_chunk_size: int = 100

    function chunk(text: str) -> list[Chunk]
        // Split into sentences
        sentences = split_sentences(text)

        chunks = []
        current_chunk = []
        current_size = 0

        for sentence in sentences:
            sentence_size = count_tokens(sentence)
            if current_size + sentence_size > self.max_chunk_size and current_size > self.min_chunk_size:
                chunks.append(Chunk(text=" ".join(current_chunk)))
                current_chunk = []
                current_size = 0
            current_chunk.append(sentence)
            current_size += sentence_size

        if current_chunk:
            chunks.append(Chunk(text=" ".join(current_chunk)))

        return chunks

Pros: Preserves semantic units, better retrieval quality. Cons: Variable chunk sizes, more complex implementation.

Recursive Chunking

Hierarchical splitting: try large splits first (sections), then fall back to smaller splits (paragraphs, sentences):

class RecursiveChunker:
    separators: list[str] = ["\n\n", "\n", ". ", " "]
    max_chunk_size: int = 512

    function chunk(text: str, separator_idx: int = 0) -> list[Chunk]
        if count_tokens(text) <= self.max_chunk_size:
            return [Chunk(text=text)]

        if separator_idx >= len(self.separators):
            return [Chunk(text=text[:self.max_chunk_size])]

        separator = self.separators[separator_idx]
        splits = text.split(separator)

        chunks = []
        current = ""
        for split in splits:
            if count_tokens(current + separator + split) > self.max_chunk_size:
                if current:
                    chunks.extend(self.chunk(current, separator_idx + 1))
                current = split
            else:
                current = current + separator + split if current else split

        if current:
            chunks.extend(self.chunk(current, separator_idx + 1))

        return chunks

Pros: Respects document structure, handles varied document types. Cons: More complex, may produce very uneven chunk sizes.

Agentic / Smart Chunking

Use an LLM to determine optimal chunk boundaries:

  1. Pass the document to an LLM with instructions to identify natural topic boundaries.
  2. The LLM returns split points based on semantic understanding.
  3. More expensive but highest quality chunking.

Code-Aware Chunking

For code repositories, split by structural units:

  • Functions/Methods: Each function as a separate chunk.
  • Classes: Chunk by class with method-level granularity.
  • Files: Small files as single chunks.
  • AST-based: Use abstract syntax tree to find semantic boundaries.

Tools: Tree-sitter parsers, language-specific AST tools.

Chunking Comparison

Strategy Semantic Quality Implementation Cost Best For
Fixed-Size Low Simple Low Homogeneous text
Semantic (Sentence) Medium Moderate Low Articles, documentation
Recursive Medium-High Moderate Low Mixed-format documents
Document Structure High Complex Low Well-structured documents (Markdown, HTML)
LLM-based Highest Moderate High High-value, complex documents
Code-Aware High (for code) Complex Low Source code repositories

4. Embedding Models

Embeddings convert text into dense vector representations where semantic similarity is captured by geometric proximity (cosine similarity, dot product).

How Embedding Models Work

Embedding models are typically encoder-only transformers (BERT-family) trained with contrastive learning objectives:

  1. Contrastive Learning: Train the model so that semantically similar texts have similar embeddings and dissimilar texts have different embeddings.
  2. Training Data: Pairs or triplets of (query, positive_passage, negative_passage).
  3. Loss Function: InfoNCE loss, triplet loss, or multiple negatives ranking loss.
Model Dimensions Max Tokens MTEB Score Provider Notes
text-embedding-3-large 3072 8191 ~64.6 OpenAI Best commercial, Matryoshka support
text-embedding-3-small 1536 8191 ~62.3 OpenAI Good balance of cost/quality
voyage-3 1024 32000 ~67.1 Voyage AI Best for code and technical text
BGE-large-en-v1.5 1024 512 ~64.2 BAAI Best open-source (English)
E5-mistral-7b-instruct 4096 32768 ~66.6 Microsoft Instruction-tuned, very high quality
all-MiniLM-L6-v2 384 256 ~56.3 SentenceTransformers Fast, lightweight
Cohere embed-v3 1024 512 ~64.5 Cohere Compression support, multilingual
nomic-embed-text-v1.5 768 8192 ~62.3 Nomic Open-source, long context

Matryoshka Embeddings

A technique where embeddings are trained so that truncated prefixes (e.g., first 256 dims of a 1536-dim vector) are still useful embeddings:

  • Full 3072 dims: Maximum quality for critical applications.
  • 1536 dims: 95%+ quality, half the storage.
  • 512 dims: 90%+ quality, 1/6 the storage.
  • 256 dims: 85%+ quality, fast approximate search.

Enables adaptive quality/cost trade-offs without retraining.

Embedding Best Practices

  1. Match the embedding model to your data: Use code-specialized embeddings for code, multilingual for international content.
  2. Instruction-tuned embeddings: Models like E5 support different instructions for queries vs. documents, improving retrieval.
  3. Normalize embeddings: For cosine similarity, ensure vectors are L2-normalized.
  4. Test on your data: MTEB scores are useful but don't substitute for evaluation on your specific domain.
  5. Consider dimensions: Higher dimensions capture more nuance but cost more storage and compute.

5. Vector Databases

Vector databases are specialized systems for storing and querying high-dimensional vectors efficiently. They are the backbone of RAG systems, enabling fast approximate nearest neighbor (ANN) search.

Database Architecture Key Features Best For
Pinecone Managed cloud Auto-scaling, metadata filtering, serverless Production, ease of use
Weaviate Open-source + cloud GraphQL API, hybrid search, modules Complex queries, graph relationships
Qdrant Open-source + cloud Rich filtering, payload storage, gRPC High performance, complex filtering
Chroma Open-source Simple Python API, local-first Development, prototyping
FAISS Library (Meta) GPU acceleration, multiple index types Research, large-scale, in-memory
Milvus Open-source Distributed, cloud-native (Zilliz Cloud) Enterprise scale
PGvector PostgreSQL extension SQL integration, ACID compliance Existing PostgreSQL users
LanceDB Open-source Serverless, disk-based, multi-modal Cost-effective, serverless

Vector Index Types

The key to fast vector search is the index structure. Different index types trade off between speed, accuracy, and memory:

HNSW (Hierarchical Navigable Small World)

The most popular index type. Builds a multi-layer graph where:

  • Bottom layer contains all vectors, connected to their nearest neighbors.
  • Upper layers are progressively sparser, enabling fast navigation.
  • Search starts at the top layer and navigates down to find nearest neighbors.
// HNSW Search (simplified)
function hnsw_search(query: Vector, top_k: int) -> list[Result]
    // Start at top layer with entry point
    current = entry_point
    for layer in range(top_layer, 0, -1):
        // Greedily navigate to nearest neighbor in this layer
        current = greedy_search(query, current, layer)
    // Exhaustive search in bottom layer from current position
    return beam_search(query, current, layer=0, top_k=top_k)

Parameters: - M: Number of connections per node (higher = better recall, more memory). Typical: 16-64. - ef_construction: Search width during index building (higher = better quality, slower build). Typical: 128-512. - ef_search: Search width during querying (higher = better recall, slower query). Typical: 64-256.

Complexity: O(log n) search time, O(n × M) memory.

IVF (Inverted File Index)

Clusters vectors into buckets using k-means, then searches only the nearest clusters:

  1. Build: Run k-means to create nlist cluster centroids.
  2. Search: Find nearest nprobe centroids, then search vectors within those clusters.

Parameters: - nlist: Number of clusters (typical: sqrt(n) to 4×sqrt(n)). - nprobe: Number of clusters to search (higher = better recall, slower).

Complexity: O(nprobe × n/nlist) per query.

LSH (Locality Sensitive Hashing)

Hash vectors such that similar vectors hash to the same bucket:

  • Uses random hyperplane projections.
  • Fast but lower recall than HNSW/IVF.
  • Good for very large-scale, approximate search.

Similarity Metrics

Metric Formula Properties When to Use
Cosine Similarity cos(θ) = (A·B) / (\|\|A\|\| × \|\|B\|\|) Scale-invariant, range [-1, 1] Most common, especially for text embeddings
Dot Product A·B Captures both angle and magnitude When magnitude matters (popularity, relevance scores)
Euclidean (L2) \|\|A - B\|\| Distance metric (lower = more similar) When absolute position matters

Note: For L2-normalized vectors (unit vectors), cosine similarity and dot product give identical rankings.

Metadata Filtering

Vector databases support filtering by metadata (dates, categories, sources) before or after vector search:

// Search for similar vectors with metadata filter
results = vector_db.search(
    query_vector=embed("What is RAG?"),
    top_k=10,
    filter={
        "source": {"$in": ["documentation", "tutorials"]},
        "date": {"$gte": "2024-01-01"},
        "language": "en"
    }
)

Pre-filtering: Apply metadata filter first, then vector search on filtered set. Efficient when filter is selective. Post-filtering: Vector search first, then filter results. Better recall but may return fewer results.

Combine vector (semantic) search with keyword (lexical) search for better coverage:

class HybridSearch:
    vector_store: VectorDatabase
    bm25_index: BM25Index

    function search(query: str, top_k: int = 10, alpha: float = 0.7) -> list[Result]
        query_vector = embed(query)

        // Semantic search
        vector_results = self.vector_store.search(query_vector, top_k=top_k * 3)

        // Keyword search (BM25)
        keyword_results = self.bm25_index.search(query, top_k=top_k * 3)

        // Combine with Reciprocal Rank Fusion (RRF)
        combined = reciprocal_rank_fusion(vector_results, keyword_results, k=60)

        return combined[:top_k]

function reciprocal_rank_fusion(
    *result_lists: list[list[Result]],
    k: int = 60
) -> list[Result]
    scores = {}
    for result_list in result_lists:
        for rank, result in enumerate(result_list):
            if result.id not in scores:
                scores[result.id] = 0
            scores[result.id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Why hybrid?: - Vector search finds semantically similar content ("car" matches "automobile"). - Keyword search finds exact matches ("API-KEY-12345" won't have a semantic match). - Combining both covers more cases than either alone.

Pseudocode (Vector Database Operations)

class VectorDatabase:
    index: HNSWIndex
    metadata_store: dict[str, dict]
    text_store: dict[str, str]

    function upsert(ids: list[str], vectors: list[Tensor], metadata: list[dict], texts: list[str])
        for id, vector, meta, text in zip(ids, vectors, metadata, texts):
            self.index.add(vector, id=id)
            self.metadata_store[id] = meta
            self.text_store[id] = text

    function search(
        query_vector: Tensor,
        top_k: int = 10,
        filter: dict = None
    ) -> list[SearchResult]
        // Get candidates from vector index
        if filter:
            candidate_ids = self.apply_metadata_filter(filter)
            scores, ids = self.index.search(query_vector, top_k, candidates=candidate_ids)
        else:
            scores, ids = self.index.search(query_vector, top_k)

        results = []
        for score, id in zip(scores, ids):
            results.append(SearchResult(
                id=id,
                score=score,
                text=self.text_store[id],
                metadata=self.metadata_store[id]
            ))
        return results

    function delete(ids: list[str])
        for id in ids:
            self.index.remove(id)
            del self.metadata_store[id]
            del self.text_store[id]

6. Retrieval Strategies

The simplest retrieval: embed the query, find the top-K nearest vectors.

  • Pros: Fast, simple, works well for straightforward queries.
  • Cons: May miss relevant documents with different wording. Single query may not capture user intent.

Re-ranking

After initial retrieval, re-rank results using a more expensive but accurate model:

Cross-Encoder Re-ranking

A cross-encoder takes the (query, document) pair as input and outputs a relevance score. Unlike bi-encoders (used for initial retrieval), cross-encoders can model fine-grained interactions between query and document.

class CrossEncoderReranker:
    model: CrossEncoder  // e.g., ms-marco-MiniLM-L-12-v2

    function rerank(query: str, results: list[SearchResult], top_k: int) -> list[SearchResult]
        // Score each (query, document) pair
        pairs = [(query, result.text) for result in results]
        scores = self.model.predict(pairs)

        // Sort by cross-encoder score
        scored_results = zip(results, scores)
        sorted_results = sorted(scored_results, key=lambda x: x[1], reverse=True)

        return [result for result, score in sorted_results[:top_k]]

Why re-rank?: Bi-encoders (used for initial retrieval) encode query and document independently — they can't model token-level interactions. Cross-encoders see both together, enabling much higher precision. The trade-off is speed: cross-encoders are ~100× slower, so they're only applied to the top-N candidates.

LLM Re-ranking

Use an LLM to assess relevance:

prompt = f"""Rate the relevance of this document to the query on a scale of 1-10.

Query: {query}
Document: {document}

Relevance score (1-10):"""

More expensive but can capture nuanced relevance that embedding similarity misses.

Query Processing Techniques

Query Expansion

Generate query variations to improve recall:

class QueryExpander:
    llm: LanguageModel

    function expand(query: str, num_variations: int = 3) -> list[str]
        prompt = f"""Generate {num_variations} alternative phrasings of this search query
that might match relevant documents. Include synonyms and related concepts.

Original query: {query}

Alternative queries:"""

        variations = self.llm.generate(prompt)
        return [query] + parse_variations(variations)

Query Rewriting

Transform the user query into a better search query:

function rewrite_query(user_query: str, chat_history: list[dict]) -> str
    prompt = f"""Given the conversation history and the latest user question,
formulate a standalone search query that captures the full intent.

Chat history:
{format_history(chat_history)}

Latest question: {user_query}

Standalone search query:"""

    return llm.generate(prompt)

This is critical for conversational RAG where the user's query references previous messages (e.g., "Tell me more about that" — "that" must be resolved using chat history).

HyDE (Hypothetical Document Embeddings)

Instead of embedding the query directly, generate a hypothetical answer and embed that:

  1. User asks: "What causes aurora borealis?"
  2. LLM generates a hypothetical answer (may be imperfect).
  3. Embed the hypothetical answer (which is in the same "space" as actual documents).
  4. Search with this embedding.

Why it works: Queries and documents have different linguistic structures. A hypothetical answer is more similar to actual documents than a short query is.

Multi-Query Retrieval

Generate multiple search queries, retrieve for each, then merge results:

function multi_query_retrieve(user_query: str, top_k: int) -> list[SearchResult]
    // Generate multiple perspectives
    queries = generate_query_variations(user_query)  // e.g., 3-5 queries

    // Retrieve for each
    all_results = []
    for query in queries:
        results = vector_store.search(embed(query), top_k=top_k)
        all_results.extend(results)

    // Deduplicate and merge (RRF or score aggregation)
    merged = reciprocal_rank_fusion(all_results)
    return merged[:top_k]

Retrieval Strategy Comparison

Strategy Recall Precision Latency Cost Best For
Basic similarity Medium Medium Low Low Simple queries
Hybrid (vector + BM25) High Medium Low Low General use
+ Re-ranking High High Medium Medium Precision-critical
+ Query expansion Very High Medium Medium Medium Recall-critical
+ HyDE High Medium-High High High Short/vague queries
+ Multi-query Very High High High High Complex queries

7. Advanced RAG Patterns

Naive RAG

The simplest implementation:

Query → Embed → Retrieve Top-K → Stuff into Prompt → Generate

Limitations: No query processing, no re-ranking, no handling of conflicting sources, no iterative retrieval.

Advanced RAG

Adds pre-retrieval and post-retrieval processing:

Query → Rewrite → Expand → Retrieve → Re-rank → Filter → Augment Prompt → Generate

Specific techniques:

  • Parent-Child Chunking: Store small chunks for retrieval precision, but include the parent (larger) chunk in the LLM context for completeness.
  • Sentence Window Retrieval: Retrieve at the sentence level for precision, but expand to include surrounding sentences for context.
  • Metadata Filtering: Pre-filter by date, source, document type before vector search.
  • Contextual Compression: Use an LLM to extract only the relevant portions from retrieved chunks, reducing noise.

Self-RAG (Self-Reflective RAG)

The model decides dynamically whether to retrieve, and evaluates the quality of retrieved information:

class SelfRAG:
    llm: LanguageModel
    retriever: Retriever

    function query(question: str) -> str
        // Step 1: Decide if retrieval is needed
        needs_retrieval = self.llm.generate(
            f"Do you need to search for information to answer: '{question}'? Yes/No"
        )

        if needs_retrieval == "Yes":
            // Step 2: Retrieve
            documents = self.retriever.search(question)

            // Step 3: Evaluate relevance of each document
            relevant_docs = []
            for doc in documents:
                is_relevant = self.llm.generate(
                    f"Is this document relevant to '{question}'?\nDocument: {doc}\nRelevant? Yes/No"
                )
                if is_relevant == "Yes":
                    relevant_docs.append(doc)

            // Step 4: Generate with relevant documents
            response = self.llm.generate(
                f"Context: {relevant_docs}\nQuestion: {question}\nAnswer:"
            )

            // Step 5: Verify factual consistency
            is_supported = self.llm.generate(
                f"Is this answer supported by the context?\nContext: {relevant_docs}\nAnswer: {response}\nSupported? Yes/No"
            )

            if is_supported == "No":
                response = self.llm.generate(
                    f"Revise this answer to be consistent with the context:\n{response}"
                )
        else:
            response = self.llm.generate(f"Answer: {question}")

        return response

Corrective RAG (CRAG)

Adds a correction step when retrieved documents are insufficient:

  1. Retrieve documents for the query.
  2. Evaluate retrieval quality using a lightweight evaluator.
  3. If quality is high: Use retrieved docs directly.
  4. If quality is medium: Refine the query and retrieve again.
  5. If quality is low: Fall back to web search or other knowledge sources.

Graph RAG

Combines vector retrieval with knowledge graphs:

  1. Build a knowledge graph from documents (extract entities and relationships using an LLM).
  2. Community detection: Group related entities into communities.
  3. Community summaries: Generate summaries for each community.
  4. Retrieval: Combine vector search with graph traversal:
    • Vector search finds relevant chunks.
    • Graph traversal finds related entities and their connections.
    • Community summaries provide high-level context.

Benefits: Better at answering questions that require connecting information across multiple documents (multi-hop reasoning).

Pseudocode (Graph RAG):

class GraphRAG:
    vector_store: VectorDatabase
    knowledge_graph: KnowledgeGraph  // Entities + relationships
    community_summaries: dict[str, str]

    function query(question: str) -> str
        // 1. Vector retrieval
        vector_results = self.vector_store.search(embed(question), top_k=10)

        // 2. Entity extraction from query
        entities = extract_entities(question)

        // 3. Graph traversal: find related entities and paths
        graph_context = []
        for entity in entities:
            neighbors = self.knowledge_graph.get_neighbors(entity, depth=2)
            paths = self.knowledge_graph.find_paths(entities)
            graph_context.append(format_graph_info(neighbors, paths))

        // 4. Get relevant community summaries
        communities = self.knowledge_graph.get_communities(entities)
        summaries = [self.community_summaries[c] for c in communities]

        // 5. Combine all context
        context = combine(vector_results, graph_context, summaries)
        return self.llm.generate(build_prompt(question, context))

Agentic RAG

An AI agent orchestrates the RAG process, making dynamic decisions:

  • Which knowledge bases to search.
  • Whether to perform additional retrieval based on initial results.
  • How to combine information from multiple sources.
  • When to ask for clarification vs. attempt an answer.
class AgenticRAG:
    agent: ReActAgent
    knowledge_bases: dict[str, VectorDatabase]  // Multiple knowledge bases

    function query(question: str) -> str
        tools = [
            Tool("search_docs", "Search documentation", self.search_docs),
            Tool("search_code", "Search code repository", self.search_code),
            Tool("search_web", "Search the web", self.search_web),
            Tool("ask_clarification", "Ask user for clarification", self.ask_user),
        ]

        return self.agent.execute(
            goal=f"Answer this question accurately: {question}",
            tools=tools,
            max_iterations=5
        )

RAG Pattern Comparison

Pattern Complexity Quality Latency Best For
Naive RAG Low Medium Low Prototyping, simple use cases
Advanced RAG Medium High Medium Production use cases
Self-RAG High Very High High Accuracy-critical applications
CRAG Medium High Medium Variable document quality
Graph RAG High Very High High Multi-hop reasoning, complex domains
Agentic RAG Very High Highest High Complex, multi-source queries

8. RAG Evaluation

Evaluating RAG systems requires assessing both retrieval quality and generation quality.

Retrieval Metrics

Metric Formula What It Measures
Precision@K Relevant retrieved / K Fraction of top-K results that are relevant
Recall@K Relevant retrieved / Total relevant Fraction of all relevant docs found in top-K
MRR (Mean Reciprocal Rank) 1/rank of first relevant result How quickly the first relevant result appears
NDCG@K Normalized DCG Quality of ranking (accounts for position and relevance grade)
Hit Rate Queries with ≥1 relevant result / Total queries Percentage of queries with any relevant result

Generation Metrics (RAGAS Framework)

RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics:

Metric What It Measures How
Faithfulness Is the answer supported by the context? LLM checks each claim against retrieved docs
Answer Relevance Does the answer address the question? Generate questions from the answer, compare to original
Context Relevance Is the retrieved context relevant? LLM rates each retrieved chunk's relevance
Context Recall Does the context contain the ground truth? Compare retrieved context against reference answer
Answer Correctness Is the answer factually correct? Compare against ground truth (when available)

Evaluation Pipeline

class RAGEvaluator:
    evaluator_llm: LanguageModel  // Separate LLM for evaluation

    function evaluate(
        questions: list[str],
        ground_truths: list[str],
        rag_system: RAGSystem
    ) -> dict
        results = {"faithfulness": [], "relevance": [], "correctness": []}

        for question, truth in zip(questions, ground_truths):
            response = rag_system.query(question)

            // Faithfulness: is the answer grounded in retrieved context?
            faithfulness = self.check_faithfulness(
                response.answer, response.sources
            )
            results["faithfulness"].append(faithfulness)

            // Relevance: does the answer address the question?
            relevance = self.check_relevance(question, response.answer)
            results["relevance"].append(relevance)

            // Correctness: does the answer match ground truth?
            correctness = self.check_correctness(response.answer, truth)
            results["correctness"].append(correctness)

        return {k: mean(v) for k, v in results.items()}

End-to-End Evaluation Best Practices

  1. Create a golden dataset: Manually curate question-answer pairs from your actual documents. At least 50-100 pairs for meaningful evaluation.
  2. Test retrieval and generation separately: Poor generation might be caused by poor retrieval. Isolate the issue.
  3. Use LLM-as-judge: Use a strong LLM (GPT-4, Claude) to evaluate response quality at scale. Calibrate against human judgments.
  4. Monitor in production: Track user feedback (thumbs up/down), implicit signals (follow-up questions, session abandonment), and explicit corrections.
  5. A/B test changes: Any change to chunking, embedding model, retrieval strategy, or prompt should be A/B tested.

9. RAG vs. Fine-tuning

Aspect RAG Fine-tuning RAG + Fine-tuning
Data Requirements Documents (no labels) Labeled examples Both
Cost Low (embedding + inference) High (training) High
Update Frequency Real-time (add/remove docs) Requires retraining Mixed
Knowledge Type Factual, domain-specific Behavioral, stylistic Both
Hallucination Lower (grounded) Varies Lowest
Best For Knowledge bases, Q&A, docs Task behavior, tone, format Mission-critical applications

When to use RAG: You have a knowledge base that changes frequently, need verifiable/citable answers, or want to keep the base model general.

When to fine-tune: You need specific output format, tone, or behavior; the task is specialized (medical, legal); or you need the model to "know" domain-specific patterns.

Best approach: Often combined — fine-tune for task behavior and output format, use RAG for factual knowledge.


10. Production Best Practices

  1. Chunking matters most: Invest time in finding the right chunking strategy for your documents. Test multiple approaches and evaluate.
  2. Embedding model selection: Test 2-3 embedding models on your actual data before committing. Domain-specific models often outperform general ones.
  3. Hybrid search by default: Combine vector and keyword search. It's almost always better than either alone.
  4. Re-rank for precision: A cross-encoder re-ranker significantly improves precision with modest latency cost.
  5. Metadata is powerful: Rich metadata enables filtering that dramatically improves relevance (date ranges, document types, access levels).
  6. Monitor retrieval quality: Track precision@K, recall@K, and user feedback continuously. Retrieval quality degrades as your corpus grows and changes.
  7. Handle "I don't know": Instruct the LLM to say when it lacks sufficient context rather than guessing. Measure and optimize the rate of correct "I don't know" responses.
  8. Citation and attribution: Always include source references in generated answers. This builds trust and enables verification.
  9. Evaluate end-to-end: Don't just evaluate retrieval or generation in isolation. The system quality is what matters to users.
  10. Plan for scale: Choose a vector database that handles your expected growth. Consider sharding, replication, and index rebuild strategies.
  11. Cache wisely: Cache embedding computations, frequently-asked queries, and LLM responses. Semantic caching (similar queries → cached answer) can dramatically reduce costs.
  12. Security: Implement access controls on documents (not all users should see all content), sanitize retrieved content, and guard against prompt injection through retrieved text.