Retrieval-Augmented Generation (RAG)¶

Retrieval-Augmented Generation (RAG) is an architecture that enhances LLMs by retrieving relevant information from external knowledge bases and incorporating it into the generation process. RAG addresses fundamental LLM limitations: knowledge cutoff dates, hallucination, lack of access to private/domain-specific data, and the inability to cite sources.

Rather than relying solely on knowledge encoded in model weights during pre-training, RAG systems dynamically fetch relevant context at inference time, grounding model responses in verifiable, up-to-date information. This chapter covers the complete RAG pipeline—from document ingestion and embedding to retrieval strategies, advanced patterns, and evaluation.

1. Why RAG?¶

The Problem with Parametric Knowledge¶

LLMs store knowledge in their parameters (weights) during pre-training. This "parametric knowledge" has fundamental limitations:

Knowledge Cutoff: Models only know what was in their training data. A model trained in 2024 doesn't know about events in 2025.
Hallucination: When asked about topics outside their training data (or at the boundary of their knowledge), models confabulate plausible-sounding but incorrect answers.
No Private Data: Models cannot access proprietary databases, internal documents, or user-specific information.
No Verifiability: You cannot trace a parametric answer back to a source document.
Expensive Updates: Updating parametric knowledge requires retraining or fine-tuning (costly and slow).

RAG as a Solution¶

RAG combines the strengths of retrieval systems (precise, up-to-date, verifiable) with generative models (fluent, reasoning-capable, flexible):

Aspect	Pure LLM	Pure Retrieval	RAG
Fluency	Excellent	N/A (returns documents)	Excellent
Accuracy	Varies (hallucination risk)	High (if relevant doc exists)	High (grounded in sources)
Up-to-date	No (knowledge cutoff)	Yes (live index)	Yes
Private Data	No	Yes	Yes
Citable	No	Yes	Yes
Reasoning	Yes	No	Yes

2. RAG Architecture Overview¶

Core Workflow¶

                    ┌─────────────────────────────────────────────┐
                    │              INDEXING (Offline)               │
                    │                                              │
                    │  Documents → Chunking → Embedding → VectorDB │
                    └─────────────────────────────────────────────┘

                    ┌─────────────────────────────────────────────┐
                    │              QUERYING (Online)                │
                    │                                              │
                    │  User Query → Embed → Retrieve → Augment     │
                    │            → Generate Response                │
                    └─────────────────────────────────────────────┘

Phase 1: Indexing (Offline) 1. Load documents from various sources (PDFs, web pages, databases, APIs). 2. Chunk documents into manageable pieces. 3. Embed each chunk into a dense vector using an embedding model. 4. Store vectors + metadata in a vector database.

Phase 2: Querying (Online) 1. Embed the user query using the same embedding model. 2. Retrieve the top-K most similar chunks from the vector database. 3. Augment the LLM prompt with retrieved context. 4. Generate a response grounded in the retrieved information.

Pseudocode (Complete RAG System)¶

class RAGSystem:
    embedding_model: EmbeddingModel
    vector_store: VectorDatabase
    llm: LanguageModel
    chunker: DocumentChunker
    reranker: Reranker = None

    // ---- INDEXING ----

    function ingest(documents: list[Document])
        for doc in documents:
            // 1. Chunk the document
            chunks = self.chunker.chunk(doc.text, metadata=doc.metadata)

            // 2. Generate embeddings
            embeddings = self.embedding_model.encode_batch([c.text for c in chunks])

            // 3. Store in vector database
            self.vector_store.upsert(
                ids=[c.id for c in chunks],
                vectors=embeddings,
                metadata=[c.metadata for c in chunks],
                texts=[c.text for c in chunks]
            )

    // ---- QUERYING ----

    function query(user_query: str, top_k: int = 5) -> RAGResponse
        // 1. Embed the query
        query_vector = self.embedding_model.encode(user_query)

        // 2. Retrieve relevant chunks
        results = self.vector_store.search(query_vector, top_k=top_k * 3)

        // 3. Optional: re-rank for precision
        if self.reranker:
            results = self.reranker.rerank(user_query, results, top_k=top_k)
        else:
            results = results[:top_k]

        // 4. Build augmented prompt
        context = self.format_context(results)
        prompt = self.build_prompt(user_query, context)

        // 5. Generate response
        response = self.llm.generate(prompt)

        return RAGResponse(
            answer=response,
            sources=results,
            confidence=self.estimate_confidence(results)
        )

    function build_prompt(query: str, context: str) -> str
        return f"""Use the following context to answer the question.
If the context doesn't contain the answer, say "I don't have enough information."
Always cite your sources.

Context:
{context}

Question: {query}

Answer:"""

3. Document Ingestion Pipeline¶

Document Loading¶

RAG systems must handle diverse document formats:

Format	Challenges	Tools
PDF	Layout extraction, tables, images, scanned docs	PyPDF2, pdfplumber, Unstructured, Adobe Extract API
HTML/Web	Boilerplate removal, dynamic content	Beautiful Soup, Trafilatura, Playwright
Markdown	Structure preservation, code blocks	Standard parsers
Word/PowerPoint	Complex formatting, embedded objects	python-docx, python-pptx, Unstructured
Code	Syntax awareness, function boundaries	Tree-sitter, language-specific parsers
Databases	Schema understanding, query generation	SQLAlchemy, custom connectors
APIs	Rate limiting, pagination, auth	Custom connectors

Chunking Strategies¶

Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval.

Fixed-Size Chunking¶

Split text into chunks of a fixed number of tokens/characters:

class FixedSizeChunker:
    chunk_size: int = 512     // tokens
    chunk_overlap: int = 50   // tokens overlap between chunks

    function chunk(text: str) -> list[Chunk]
        tokens = tokenize(text)
        chunks = []
        start = 0
        while start < len(tokens):
            end = min(start + self.chunk_size, len(tokens))
            chunk_text = detokenize(tokens[start:end])
            chunks.append(Chunk(text=chunk_text, start=start, end=end))
            start += self.chunk_size - self.chunk_overlap
        return chunks

Pros: Simple, predictable chunk sizes, easy to implement. Cons: May split sentences, paragraphs, or semantic units mid-thought.

Guidelines: - Chunk size: 256-1024 tokens. Smaller chunks = more precise retrieval but less context. Larger chunks = more context but more noise. - Overlap: 10-20% of chunk size. Preserves context at boundaries.

Semantic Chunking¶

Split at natural boundaries (sentences, paragraphs, sections):

class SemanticChunker:
    max_chunk_size: int = 1024
    min_chunk_size: int = 100

    function chunk(text: str) -> list[Chunk]
        // Split into sentences
        sentences = split_sentences(text)

        chunks = []
        current_chunk = []
        current_size = 0

        for sentence in sentences:
            sentence_size = count_tokens(sentence)
            if current_size + sentence_size > self.max_chunk_size and current_size > self.min_chunk_size:
                chunks.append(Chunk(text=" ".join(current_chunk)))
                current_chunk = []
                current_size = 0
            current_chunk.append(sentence)
            current_size += sentence_size

        if current_chunk:
            chunks.append(Chunk(text=" ".join(current_chunk)))

        return chunks

Pros: Preserves semantic units, better retrieval quality. Cons: Variable chunk sizes, more complex implementation.

Recursive Chunking¶

Hierarchical splitting: try large splits first (sections), then fall back to smaller splits (paragraphs, sentences):

class RecursiveChunker:
    separators: list[str] = ["\n\n", "\n", ". ", " "]
    max_chunk_size: int = 512

    function chunk(text: str, separator_idx: int = 0) -> list[Chunk]
        if count_tokens(text) <= self.max_chunk_size:
            return [Chunk(text=text)]

        if separator_idx >= len(self.separators):
            return [Chunk(text=text[:self.max_chunk_size])]

        separator = self.separators[separator_idx]
        splits = text.split(separator)

        chunks = []
        current = ""
        for split in splits:
            if count_tokens(current + separator + split) > self.max_chunk_size:
                if current:
                    chunks.extend(self.chunk(current, separator_idx + 1))
                current = split
            else:
                current = current + separator + split if current else split

        if current:
            chunks.extend(self.chunk(current, separator_idx + 1))

        return chunks

Pros: Respects document structure, handles varied document types. Cons: More complex, may produce very uneven chunk sizes.

Agentic / Smart Chunking¶

Use an LLM to determine optimal chunk boundaries:

Pass the document to an LLM with instructions to identify natural topic boundaries.
The LLM returns split points based on semantic understanding.
More expensive but highest quality chunking.

Code-Aware Chunking¶

For code repositories, split by structural units:

Functions/Methods: Each function as a separate chunk.
Classes: Chunk by class with method-level granularity.
Files: Small files as single chunks.
AST-based: Use abstract syntax tree to find semantic boundaries.

Tools: Tree-sitter parsers, language-specific AST tools.

Chunking Comparison¶

Strategy	Semantic Quality	Implementation	Cost	Best For
Fixed-Size	Low	Simple	Low	Homogeneous text
Semantic (Sentence)	Medium	Moderate	Low	Articles, documentation
Recursive	Medium-High	Moderate	Low	Mixed-format documents
Document Structure	High	Complex	Low	Well-structured documents (Markdown, HTML)
LLM-based	Highest	Moderate	High	High-value, complex documents
Code-Aware	High (for code)	Complex	Low	Source code repositories

4. Embedding Models¶

Embeddings convert text into dense vector representations where semantic similarity is captured by geometric proximity (cosine similarity, dot product).

How Embedding Models Work¶

Embedding models are typically encoder-only transformers (BERT-family) trained with contrastive learning objectives:

Contrastive Learning: Train the model so that semantically similar texts have similar embeddings and dissimilar texts have different embeddings.
Training Data: Pairs or triplets of (query, positive_passage, negative_passage).
Loss Function: InfoNCE loss, triplet loss, or multiple negatives ranking loss.

Popular Embedding Models¶

Model	Dimensions	Max Tokens	MTEB Score	Provider	Notes
text-embedding-3-large	3072	8191	~64.6	OpenAI	Best commercial, Matryoshka support
text-embedding-3-small	1536	8191	~62.3	OpenAI	Good balance of cost/quality
voyage-3	1024	32000	~67.1	Voyage AI	Best for code and technical text
BGE-large-en-v1.5	1024	512	~64.2	BAAI	Best open-source (English)
E5-mistral-7b-instruct	4096	32768	~66.6	Microsoft	Instruction-tuned, very high quality
all-MiniLM-L6-v2	384	256	~56.3	SentenceTransformers	Fast, lightweight
Cohere embed-v3	1024	512	~64.5	Cohere	Compression support, multilingual
nomic-embed-text-v1.5	768	8192	~62.3	Nomic	Open-source, long context

Matryoshka Embeddings¶

A technique where embeddings are trained so that truncated prefixes (e.g., first 256 dims of a 1536-dim vector) are still useful embeddings:

Full 3072 dims: Maximum quality for critical applications.
1536 dims: 95%+ quality, half the storage.
512 dims: 90%+ quality, 1/6 the storage.
256 dims: 85%+ quality, fast approximate search.

Enables adaptive quality/cost trade-offs without retraining.

Embedding Best Practices¶

Match the embedding model to your data: Use code-specialized embeddings for code, multilingual for international content.
Instruction-tuned embeddings: Models like E5 support different instructions for queries vs. documents, improving retrieval.
Normalize embeddings: For cosine similarity, ensure vectors are L2-normalized.
Test on your data: MTEB scores are useful but don't substitute for evaluation on your specific domain.
Consider dimensions: Higher dimensions capture more nuance but cost more storage and compute.

5. Vector Databases¶

Vector databases are specialized systems for storing and querying high-dimensional vectors efficiently. They are the backbone of RAG systems, enabling fast approximate nearest neighbor (ANN) search.

Popular Vector Databases¶

Database	Architecture	Key Features	Best For
Pinecone	Managed cloud	Auto-scaling, metadata filtering, serverless	Production, ease of use
Weaviate	Open-source + cloud	GraphQL API, hybrid search, modules	Complex queries, graph relationships
Qdrant	Open-source + cloud	Rich filtering, payload storage, gRPC	High performance, complex filtering
Chroma	Open-source	Simple Python API, local-first	Development, prototyping
FAISS	Library (Meta)	GPU acceleration, multiple index types	Research, large-scale, in-memory
Milvus	Open-source	Distributed, cloud-native (Zilliz Cloud)	Enterprise scale
PGvector	PostgreSQL extension	SQL integration, ACID compliance	Existing PostgreSQL users
LanceDB	Open-source	Serverless, disk-based, multi-modal	Cost-effective, serverless

Vector Index Types¶

The key to fast vector search is the index structure. Different index types trade off between speed, accuracy, and memory:

HNSW (Hierarchical Navigable Small World)¶

The most popular index type. Builds a multi-layer graph where:

Bottom layer contains all vectors, connected to their nearest neighbors.
Upper layers are progressively sparser, enabling fast navigation.
Search starts at the top layer and navigates down to find nearest neighbors.

// HNSW Search (simplified)
function hnsw_search(query: Vector, top_k: int) -> list[Result]
    // Start at top layer with entry point
    current = entry_point
    for layer in range(top_layer, 0, -1):
        // Greedily navigate to nearest neighbor in this layer
        current = greedy_search(query, current, layer)
    // Exhaustive search in bottom layer from current position
    return beam_search(query, current, layer=0, top_k=top_k)

Parameters: - M: Number of connections per node (higher = better recall, more memory). Typical: 16-64. - ef_construction: Search width during index building (higher = better quality, slower build). Typical: 128-512. - ef_search: Search width during querying (higher = better recall, slower query). Typical: 64-256.

Complexity: O(log n) search time, O(n × M) memory.

IVF (Inverted File Index)¶

Clusters vectors into buckets using k-means, then searches only the nearest clusters:

Build: Run k-means to create nlist cluster centroids.
Search: Find nearest nprobe centroids, then search vectors within those clusters.

Parameters: - nlist: Number of clusters (typical: sqrt(n) to 4×sqrt(n)). - nprobe: Number of clusters to search (higher = better recall, slower).

Complexity: O(nprobe × n/nlist) per query.

LSH (Locality Sensitive Hashing)¶

Hash vectors such that similar vectors hash to the same bucket:

Uses random hyperplane projections.
Fast but lower recall than HNSW/IVF.
Good for very large-scale, approximate search.

Similarity Metrics¶

Metric	Formula	Properties	When to Use
Cosine Similarity	`cos(θ) = (A·B) / (\\|\\|A\\|\\| × \\|\\|B\\|\\|)`	Scale-invariant, range [-1, 1]	Most common, especially for text embeddings
Dot Product	`A·B`	Captures both angle and magnitude	When magnitude matters (popularity, relevance scores)
Euclidean (L2)	`\\|\\|A - B\\|\\|`	Distance metric (lower = more similar)	When absolute position matters

Note: For L2-normalized vectors (unit vectors), cosine similarity and dot product give identical rankings.

Metadata Filtering¶

Vector databases support filtering by metadata (dates, categories, sources) before or after vector search:

// Search for similar vectors with metadata filter
results = vector_db.search(
    query_vector=embed("What is RAG?"),
    top_k=10,
    filter={
        "source": {"$in": ["documentation", "tutorials"]},
        "date": {"$gte": "2024-01-01"},
        "language": "en"
    }
)

Pre-filtering: Apply metadata filter first, then vector search on filtered set. Efficient when filter is selective. Post-filtering: Vector search first, then filter results. Better recall but may return fewer results.

Hybrid Search¶

Combine vector (semantic) search with keyword (lexical) search for better coverage:

class HybridSearch:
    vector_store: VectorDatabase
    bm25_index: BM25Index

    function search(query: str, top_k: int = 10, alpha: float = 0.7) -> list[Result]
        query_vector = embed(query)

        // Semantic search
        vector_results = self.vector_store.search(query_vector, top_k=top_k * 3)

        // Keyword search (BM25)
        keyword_results = self.bm25_index.search(query, top_k=top_k * 3)

        // Combine with Reciprocal Rank Fusion (RRF)
        combined = reciprocal_rank_fusion(vector_results, keyword_results, k=60)

        return combined[:top_k]

function reciprocal_rank_fusion(
    *result_lists: list[list[Result]],
    k: int = 60
) -> list[Result]
    scores = {}
    for result_list in result_lists:
        for rank, result in enumerate(result_list):
            if result.id not in scores:
                scores[result.id] = 0
            scores[result.id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Why hybrid?: - Vector search finds semantically similar content ("car" matches "automobile"). - Keyword search finds exact matches ("API-KEY-12345" won't have a semantic match). - Combining both covers more cases than either alone.

Pseudocode (Vector Database Operations)¶

class VectorDatabase:
    index: HNSWIndex
    metadata_store: dict[str, dict]
    text_store: dict[str, str]

    function upsert(ids: list[str], vectors: list[Tensor], metadata: list[dict], texts: list[str])
        for id, vector, meta, text in zip(ids, vectors, metadata, texts):
            self.index.add(vector, id=id)
            self.metadata_store[id] = meta
            self.text_store[id] = text

    function search(
        query_vector: Tensor,
        top_k: int = 10,
        filter: dict = None
    ) -> list[SearchResult]
        // Get candidates from vector index
        if filter:
            candidate_ids = self.apply_metadata_filter(filter)
            scores, ids = self.index.search(query_vector, top_k, candidates=candidate_ids)
        else:
            scores, ids = self.index.search(query_vector, top_k)

        results = []
        for score, id in zip(scores, ids):
            results.append(SearchResult(
                id=id,
                score=score,
                text=self.text_store[id],
                metadata=self.metadata_store[id]
            ))
        return results

    function delete(ids: list[str])
        for id in ids:
            self.index.remove(id)
            del self.metadata_store[id]
            del self.text_store[id]

6. Retrieval Strategies¶

Basic Similarity Search¶

The simplest retrieval: embed the query, find the top-K nearest vectors.

Pros: Fast, simple, works well for straightforward queries.
Cons: May miss relevant documents with different wording. Single query may not capture user intent.

Re-ranking¶

After initial retrieval, re-rank results using a more expensive but accurate model:

Cross-Encoder Re-ranking¶

A cross-encoder takes the (query, document) pair as input and outputs a relevance score. Unlike bi-encoders (used for initial retrieval), cross-encoders can model fine-grained interactions between query and document.

class CrossEncoderReranker:
    model: CrossEncoder  // e.g., ms-marco-MiniLM-L-12-v2

    function rerank(query: str, results: list[SearchResult], top_k: int) -> list[SearchResult]
        // Score each (query, document) pair
        pairs = [(query, result.text) for result in results]
        scores = self.model.predict(pairs)

        // Sort by cross-encoder score
        scored_results = zip(results, scores)
        sorted_results = sorted(scored_results, key=lambda x: x[1], reverse=True)

        return [result for result, score in sorted_results[:top_k]]

Why re-rank?: Bi-encoders (used for initial retrieval) encode query and document independently — they can't model token-level interactions. Cross-encoders see both together, enabling much higher precision. The trade-off is speed: cross-encoders are ~100× slower, so they're only applied to the top-N candidates.

LLM Re-ranking¶

Use an LLM to assess relevance:

prompt = f"""Rate the relevance of this document to the query on a scale of 1-10.

Query: {query}
Document: {document}

Relevance score (1-10):"""

More expensive but can capture nuanced relevance that embedding similarity misses.

Query Processing Techniques¶

Query Expansion¶

Generate query variations to improve recall:

class QueryExpander:
    llm: LanguageModel

    function expand(query: str, num_variations: int = 3) -> list[str]
        prompt = f"""Generate {num_variations} alternative phrasings of this search query
that might match relevant documents. Include synonyms and related concepts.

Original query: {query}

Alternative queries:"""

        variations = self.llm.generate(prompt)
        return [query] + parse_variations(variations)

Query Rewriting¶

Transform the user query into a better search query:

function rewrite_query(user_query: str, chat_history: list[dict]) -> str
    prompt = f"""Given the conversation history and the latest user question,
formulate a standalone search query that captures the full intent.

Chat history:
{format_history(chat_history)}

Latest question: {user_query}

Standalone search query:"""

    return llm.generate(prompt)

This is critical for conversational RAG where the user's query references previous messages (e.g., "Tell me more about that" — "that" must be resolved using chat history).

HyDE (Hypothetical Document Embeddings)¶

Instead of embedding the query directly, generate a hypothetical answer and embed that:

User asks: "What causes aurora borealis?"
LLM generates a hypothetical answer (may be imperfect).
Embed the hypothetical answer (which is in the same "space" as actual documents).
Search with this embedding.

Why it works: Queries and documents have different linguistic structures. A hypothetical answer is more similar to actual documents than a short query is.

Multi-Query Retrieval¶

Generate multiple search queries, retrieve for each, then merge results:

function multi_query_retrieve(user_query: str, top_k: int) -> list[SearchResult]
    // Generate multiple perspectives
    queries = generate_query_variations(user_query)  // e.g., 3-5 queries

    // Retrieve for each
    all_results = []
    for query in queries:
        results = vector_store.search(embed(query), top_k=top_k)
        all_results.extend(results)

    // Deduplicate and merge (RRF or score aggregation)
    merged = reciprocal_rank_fusion(all_results)
    return merged[:top_k]

Retrieval Strategy Comparison¶

Strategy	Recall	Precision	Latency	Cost	Best For
Basic similarity	Medium	Medium	Low	Low	Simple queries
Hybrid (vector + BM25)	High	Medium	Low	Low	General use
+ Re-ranking	High	High	Medium	Medium	Precision-critical
+ Query expansion	Very High	Medium	Medium	Medium	Recall-critical
+ HyDE	High	Medium-High	High	High	Short/vague queries
+ Multi-query	Very High	High	High	High	Complex queries

7. Advanced RAG Patterns¶

Naive RAG¶

The simplest implementation:

Query → Embed → Retrieve Top-K → Stuff into Prompt → Generate

Limitations: No query processing, no re-ranking, no handling of conflicting sources, no iterative retrieval.

Advanced RAG¶

Adds pre-retrieval and post-retrieval processing:

Query → Rewrite → Expand → Retrieve → Re-rank → Filter → Augment Prompt → Generate

Specific techniques:

Parent-Child Chunking: Store small chunks for retrieval precision, but include the parent (larger) chunk in the LLM context for completeness.
Sentence Window Retrieval: Retrieve at the sentence level for precision, but expand to include surrounding sentences for context.
Metadata Filtering: Pre-filter by date, source, document type before vector search.
Contextual Compression: Use an LLM to extract only the relevant portions from retrieved chunks, reducing noise.

Self-RAG (Self-Reflective RAG)¶

The model decides dynamically whether to retrieve, and evaluates the quality of retrieved information:

class SelfRAG:
    llm: LanguageModel
    retriever: Retriever

    function query(question: str) -> str
        // Step 1: Decide if retrieval is needed
        needs_retrieval = self.llm.generate(
            f"Do you need to search for information to answer: '{question}'? Yes/No"
        )

        if needs_retrieval == "Yes":
            // Step 2: Retrieve
            documents = self.retriever.search(question)

            // Step 3: Evaluate relevance of each document
            relevant_docs = []
            for doc in documents:
                is_relevant = self.llm.generate(
                    f"Is this document relevant to '{question}'?\nDocument: {doc}\nRelevant? Yes/No"
                )
                if is_relevant == "Yes":
                    relevant_docs.append(doc)

            // Step 4: Generate with relevant documents
            response = self.llm.generate(
                f"Context: {relevant_docs}\nQuestion: {question}\nAnswer:"
            )

            // Step 5: Verify factual consistency
            is_supported = self.llm.generate(
                f"Is this answer supported by the context?\nContext: {relevant_docs}\nAnswer: {response}\nSupported? Yes/No"
            )

            if is_supported == "No":
                response = self.llm.generate(
                    f"Revise this answer to be consistent with the context:\n{response}"
                )
        else:
            response = self.llm.generate(f"Answer: {question}")

        return response

Corrective RAG (CRAG)¶

Adds a correction step when retrieved documents are insufficient:

Retrieve documents for the query.
Evaluate retrieval quality using a lightweight evaluator.
If quality is high: Use retrieved docs directly.
If quality is medium: Refine the query and retrieve again.
If quality is low: Fall back to web search or other knowledge sources.

Graph RAG¶

Combines vector retrieval with knowledge graphs:

Build a knowledge graph from documents (extract entities and relationships using an LLM).
Community detection: Group related entities into communities.
Community summaries: Generate summaries for each community.
Retrieval: Combine vector search with graph traversal:
- Vector search finds relevant chunks.
- Graph traversal finds related entities and their connections.
- Community summaries provide high-level context.

Benefits: Better at answering questions that require connecting information across multiple documents (multi-hop reasoning).

Pseudocode (Graph RAG):

class GraphRAG:
    vector_store: VectorDatabase
    knowledge_graph: KnowledgeGraph  // Entities + relationships
    community_summaries: dict[str, str]

    function query(question: str) -> str
        // 1. Vector retrieval
        vector_results = self.vector_store.search(embed(question), top_k=10)

        // 2. Entity extraction from query
        entities = extract_entities(question)

        // 3. Graph traversal: find related entities and paths
        graph_context = []
        for entity in entities:
            neighbors = self.knowledge_graph.get_neighbors(entity, depth=2)
            paths = self.knowledge_graph.find_paths(entities)
            graph_context.append(format_graph_info(neighbors, paths))

        // 4. Get relevant community summaries
        communities = self.knowledge_graph.get_communities(entities)
        summaries = [self.community_summaries[c] for c in communities]

        // 5. Combine all context
        context = combine(vector_results, graph_context, summaries)
        return self.llm.generate(build_prompt(question, context))

Agentic RAG¶

An AI agent orchestrates the RAG process, making dynamic decisions:

Which knowledge bases to search.
Whether to perform additional retrieval based on initial results.
How to combine information from multiple sources.
When to ask for clarification vs. attempt an answer.

class AgenticRAG:
    agent: ReActAgent
    knowledge_bases: dict[str, VectorDatabase]  // Multiple knowledge bases

    function query(question: str) -> str
        tools = [
            Tool("search_docs", "Search documentation", self.search_docs),
            Tool("search_code", "Search code repository", self.search_code),
            Tool("search_web", "Search the web", self.search_web),
            Tool("ask_clarification", "Ask user for clarification", self.ask_user),
        ]

        return self.agent.execute(
            goal=f"Answer this question accurately: {question}",
            tools=tools,
            max_iterations=5
        )

RAG Pattern Comparison¶

Pattern	Complexity	Quality	Latency	Best For
Naive RAG	Low	Medium	Low	Prototyping, simple use cases
Advanced RAG	Medium	High	Medium	Production use cases
Self-RAG	High	Very High	High	Accuracy-critical applications
CRAG	Medium	High	Medium	Variable document quality
Graph RAG	High	Very High	High	Multi-hop reasoning, complex domains
Agentic RAG	Very High	Highest	High	Complex, multi-source queries

8. RAG Evaluation¶

Evaluating RAG systems requires assessing both retrieval quality and generation quality.

Retrieval Metrics¶

Metric	Formula	What It Measures
Precision@K	Relevant retrieved / K	Fraction of top-K results that are relevant
Recall@K	Relevant retrieved / Total relevant	Fraction of all relevant docs found in top-K
MRR (Mean Reciprocal Rank)	1/rank of first relevant result	How quickly the first relevant result appears
NDCG@K	Normalized DCG	Quality of ranking (accounts for position and relevance grade)
Hit Rate	Queries with ≥1 relevant result / Total queries	Percentage of queries with any relevant result

Generation Metrics (RAGAS Framework)¶

RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics:

Metric	What It Measures	How
Faithfulness	Is the answer supported by the context?	LLM checks each claim against retrieved docs
Answer Relevance	Does the answer address the question?	Generate questions from the answer, compare to original
Context Relevance	Is the retrieved context relevant?	LLM rates each retrieved chunk's relevance
Context Recall	Does the context contain the ground truth?	Compare retrieved context against reference answer
Answer Correctness	Is the answer factually correct?	Compare against ground truth (when available)

Evaluation Pipeline¶

class RAGEvaluator:
    evaluator_llm: LanguageModel  // Separate LLM for evaluation

    function evaluate(
        questions: list[str],
        ground_truths: list[str],
        rag_system: RAGSystem
    ) -> dict
        results = {"faithfulness": [], "relevance": [], "correctness": []}

        for question, truth in zip(questions, ground_truths):
            response = rag_system.query(question)

            // Faithfulness: is the answer grounded in retrieved context?
            faithfulness = self.check_faithfulness(
                response.answer, response.sources
            )
            results["faithfulness"].append(faithfulness)

            // Relevance: does the answer address the question?
            relevance = self.check_relevance(question, response.answer)
            results["relevance"].append(relevance)

            // Correctness: does the answer match ground truth?
            correctness = self.check_correctness(response.answer, truth)
            results["correctness"].append(correctness)

        return {k: mean(v) for k, v in results.items()}

End-to-End Evaluation Best Practices¶

Create a golden dataset: Manually curate question-answer pairs from your actual documents. At least 50-100 pairs for meaningful evaluation.
Test retrieval and generation separately: Poor generation might be caused by poor retrieval. Isolate the issue.
Use LLM-as-judge: Use a strong LLM (GPT-4, Claude) to evaluate response quality at scale. Calibrate against human judgments.
Monitor in production: Track user feedback (thumbs up/down), implicit signals (follow-up questions, session abandonment), and explicit corrections.
A/B test changes: Any change to chunking, embedding model, retrieval strategy, or prompt should be A/B tested.

9. RAG vs. Fine-tuning¶

Aspect	RAG	Fine-tuning	RAG + Fine-tuning
Data Requirements	Documents (no labels)	Labeled examples	Both
Cost	Low (embedding + inference)	High (training)	High
Update Frequency	Real-time (add/remove docs)	Requires retraining	Mixed
Knowledge Type	Factual, domain-specific	Behavioral, stylistic	Both
Hallucination	Lower (grounded)	Varies	Lowest
Best For	Knowledge bases, Q&A, docs	Task behavior, tone, format	Mission-critical applications

When to use RAG: You have a knowledge base that changes frequently, need verifiable/citable answers, or want to keep the base model general.

When to fine-tune: You need specific output format, tone, or behavior; the task is specialized (medical, legal); or you need the model to "know" domain-specific patterns.

Best approach: Often combined — fine-tune for task behavior and output format, use RAG for factual knowledge.

10. Production Best Practices¶

Chunking matters most: Invest time in finding the right chunking strategy for your documents. Test multiple approaches and evaluate.
Embedding model selection: Test 2-3 embedding models on your actual data before committing. Domain-specific models often outperform general ones.
Hybrid search by default: Combine vector and keyword search. It's almost always better than either alone.
Re-rank for precision: A cross-encoder re-ranker significantly improves precision with modest latency cost.
Metadata is powerful: Rich metadata enables filtering that dramatically improves relevance (date ranges, document types, access levels).
Monitor retrieval quality: Track precision@K, recall@K, and user feedback continuously. Retrieval quality degrades as your corpus grows and changes.
Handle "I don't know": Instruct the LLM to say when it lacks sufficient context rather than guessing. Measure and optimize the rate of correct "I don't know" responses.
Citation and attribution: Always include source references in generated answers. This builds trust and enables verification.
Evaluate end-to-end: Don't just evaluate retrieval or generation in isolation. The system quality is what matters to users.
Plan for scale: Choose a vector database that handles your expected growth. Consider sharding, replication, and index rebuild strategies.
Cache wisely: Cache embedding computations, frequently-asked queries, and LLM responses. Semantic caching (similar queries → cached answer) can dramatically reduce costs.
Security: Implement access controls on documents (not all users should see all content), sanitize retrieved content, and guard against prompt injection through retrieved text.