Retrieval-Augmented Generation (RAG)¶
Retrieval-Augmented Generation (RAG) is an architecture that enhances LLMs by retrieving relevant information from external knowledge bases and incorporating it into the generation process. RAG addresses fundamental LLM limitations: knowledge cutoff dates, hallucination, lack of access to private/domain-specific data, and the inability to cite sources.
Rather than relying solely on knowledge encoded in model weights during pre-training, RAG systems dynamically fetch relevant context at inference time, grounding model responses in verifiable, up-to-date information. This chapter covers the complete RAG pipeline—from document ingestion and embedding to retrieval strategies, advanced patterns, and evaluation.
1. Why RAG?¶
The Problem with Parametric Knowledge¶
LLMs store knowledge in their parameters (weights) during pre-training. This "parametric knowledge" has fundamental limitations:
- Knowledge Cutoff: Models only know what was in their training data. A model trained in 2024 doesn't know about events in 2025.
- Hallucination: When asked about topics outside their training data (or at the boundary of their knowledge), models confabulate plausible-sounding but incorrect answers.
- No Private Data: Models cannot access proprietary databases, internal documents, or user-specific information.
- No Verifiability: You cannot trace a parametric answer back to a source document.
- Expensive Updates: Updating parametric knowledge requires retraining or fine-tuning (costly and slow).
RAG as a Solution¶
RAG combines the strengths of retrieval systems (precise, up-to-date, verifiable) with generative models (fluent, reasoning-capable, flexible):
| Aspect | Pure LLM | Pure Retrieval | RAG |
|---|---|---|---|
| Fluency | Excellent | N/A (returns documents) | Excellent |
| Accuracy | Varies (hallucination risk) | High (if relevant doc exists) | High (grounded in sources) |
| Up-to-date | No (knowledge cutoff) | Yes (live index) | Yes |
| Private Data | No | Yes | Yes |
| Citable | No | Yes | Yes |
| Reasoning | Yes | No | Yes |
2. RAG Architecture Overview¶
Core Workflow¶
┌─────────────────────────────────────────────┐
│ INDEXING (Offline) │
│ │
│ Documents → Chunking → Embedding → VectorDB │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ QUERYING (Online) │
│ │
│ User Query → Embed → Retrieve → Augment │
│ → Generate Response │
└─────────────────────────────────────────────┘
Phase 1: Indexing (Offline) 1. Load documents from various sources (PDFs, web pages, databases, APIs). 2. Chunk documents into manageable pieces. 3. Embed each chunk into a dense vector using an embedding model. 4. Store vectors + metadata in a vector database.
Phase 2: Querying (Online) 1. Embed the user query using the same embedding model. 2. Retrieve the top-K most similar chunks from the vector database. 3. Augment the LLM prompt with retrieved context. 4. Generate a response grounded in the retrieved information.
Pseudocode (Complete RAG System)¶
class RAGSystem:
embedding_model: EmbeddingModel
vector_store: VectorDatabase
llm: LanguageModel
chunker: DocumentChunker
reranker: Reranker = None
// ---- INDEXING ----
function ingest(documents: list[Document])
for doc in documents:
// 1. Chunk the document
chunks = self.chunker.chunk(doc.text, metadata=doc.metadata)
// 2. Generate embeddings
embeddings = self.embedding_model.encode_batch([c.text for c in chunks])
// 3. Store in vector database
self.vector_store.upsert(
ids=[c.id for c in chunks],
vectors=embeddings,
metadata=[c.metadata for c in chunks],
texts=[c.text for c in chunks]
)
// ---- QUERYING ----
function query(user_query: str, top_k: int = 5) -> RAGResponse
// 1. Embed the query
query_vector = self.embedding_model.encode(user_query)
// 2. Retrieve relevant chunks
results = self.vector_store.search(query_vector, top_k=top_k * 3)
// 3. Optional: re-rank for precision
if self.reranker:
results = self.reranker.rerank(user_query, results, top_k=top_k)
else:
results = results[:top_k]
// 4. Build augmented prompt
context = self.format_context(results)
prompt = self.build_prompt(user_query, context)
// 5. Generate response
response = self.llm.generate(prompt)
return RAGResponse(
answer=response,
sources=results,
confidence=self.estimate_confidence(results)
)
function build_prompt(query: str, context: str) -> str
return f"""Use the following context to answer the question.
If the context doesn't contain the answer, say "I don't have enough information."
Always cite your sources.
Context:
{context}
Question: {query}
Answer:"""
3. Document Ingestion Pipeline¶
Document Loading¶
RAG systems must handle diverse document formats:
| Format | Challenges | Tools |
|---|---|---|
| Layout extraction, tables, images, scanned docs | PyPDF2, pdfplumber, Unstructured, Adobe Extract API | |
| HTML/Web | Boilerplate removal, dynamic content | Beautiful Soup, Trafilatura, Playwright |
| Markdown | Structure preservation, code blocks | Standard parsers |
| Word/PowerPoint | Complex formatting, embedded objects | python-docx, python-pptx, Unstructured |
| Code | Syntax awareness, function boundaries | Tree-sitter, language-specific parsers |
| Databases | Schema understanding, query generation | SQLAlchemy, custom connectors |
| APIs | Rate limiting, pagination, auth | Custom connectors |
Chunking Strategies¶
Chunking is one of the most impactful decisions in RAG. Poor chunking leads to poor retrieval.
Fixed-Size Chunking¶
Split text into chunks of a fixed number of tokens/characters:
class FixedSizeChunker:
chunk_size: int = 512 // tokens
chunk_overlap: int = 50 // tokens overlap between chunks
function chunk(text: str) -> list[Chunk]
tokens = tokenize(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + self.chunk_size, len(tokens))
chunk_text = detokenize(tokens[start:end])
chunks.append(Chunk(text=chunk_text, start=start, end=end))
start += self.chunk_size - self.chunk_overlap
return chunks
Pros: Simple, predictable chunk sizes, easy to implement. Cons: May split sentences, paragraphs, or semantic units mid-thought.
Guidelines: - Chunk size: 256-1024 tokens. Smaller chunks = more precise retrieval but less context. Larger chunks = more context but more noise. - Overlap: 10-20% of chunk size. Preserves context at boundaries.
Semantic Chunking¶
Split at natural boundaries (sentences, paragraphs, sections):
class SemanticChunker:
max_chunk_size: int = 1024
min_chunk_size: int = 100
function chunk(text: str) -> list[Chunk]
// Split into sentences
sentences = split_sentences(text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_size = count_tokens(sentence)
if current_size + sentence_size > self.max_chunk_size and current_size > self.min_chunk_size:
chunks.append(Chunk(text=" ".join(current_chunk)))
current_chunk = []
current_size = 0
current_chunk.append(sentence)
current_size += sentence_size
if current_chunk:
chunks.append(Chunk(text=" ".join(current_chunk)))
return chunks
Pros: Preserves semantic units, better retrieval quality. Cons: Variable chunk sizes, more complex implementation.
Recursive Chunking¶
Hierarchical splitting: try large splits first (sections), then fall back to smaller splits (paragraphs, sentences):
class RecursiveChunker:
separators: list[str] = ["\n\n", "\n", ". ", " "]
max_chunk_size: int = 512
function chunk(text: str, separator_idx: int = 0) -> list[Chunk]
if count_tokens(text) <= self.max_chunk_size:
return [Chunk(text=text)]
if separator_idx >= len(self.separators):
return [Chunk(text=text[:self.max_chunk_size])]
separator = self.separators[separator_idx]
splits = text.split(separator)
chunks = []
current = ""
for split in splits:
if count_tokens(current + separator + split) > self.max_chunk_size:
if current:
chunks.extend(self.chunk(current, separator_idx + 1))
current = split
else:
current = current + separator + split if current else split
if current:
chunks.extend(self.chunk(current, separator_idx + 1))
return chunks
Pros: Respects document structure, handles varied document types. Cons: More complex, may produce very uneven chunk sizes.
Agentic / Smart Chunking¶
Use an LLM to determine optimal chunk boundaries:
- Pass the document to an LLM with instructions to identify natural topic boundaries.
- The LLM returns split points based on semantic understanding.
- More expensive but highest quality chunking.
Code-Aware Chunking¶
For code repositories, split by structural units:
- Functions/Methods: Each function as a separate chunk.
- Classes: Chunk by class with method-level granularity.
- Files: Small files as single chunks.
- AST-based: Use abstract syntax tree to find semantic boundaries.
Tools: Tree-sitter parsers, language-specific AST tools.
Chunking Comparison¶
| Strategy | Semantic Quality | Implementation | Cost | Best For |
|---|---|---|---|---|
| Fixed-Size | Low | Simple | Low | Homogeneous text |
| Semantic (Sentence) | Medium | Moderate | Low | Articles, documentation |
| Recursive | Medium-High | Moderate | Low | Mixed-format documents |
| Document Structure | High | Complex | Low | Well-structured documents (Markdown, HTML) |
| LLM-based | Highest | Moderate | High | High-value, complex documents |
| Code-Aware | High (for code) | Complex | Low | Source code repositories |
4. Embedding Models¶
Embeddings convert text into dense vector representations where semantic similarity is captured by geometric proximity (cosine similarity, dot product).
How Embedding Models Work¶
Embedding models are typically encoder-only transformers (BERT-family) trained with contrastive learning objectives:
- Contrastive Learning: Train the model so that semantically similar texts have similar embeddings and dissimilar texts have different embeddings.
- Training Data: Pairs or triplets of (query, positive_passage, negative_passage).
- Loss Function: InfoNCE loss, triplet loss, or multiple negatives ranking loss.
Popular Embedding Models¶
| Model | Dimensions | Max Tokens | MTEB Score | Provider | Notes |
|---|---|---|---|---|---|
| text-embedding-3-large | 3072 | 8191 | ~64.6 | OpenAI | Best commercial, Matryoshka support |
| text-embedding-3-small | 1536 | 8191 | ~62.3 | OpenAI | Good balance of cost/quality |
| voyage-3 | 1024 | 32000 | ~67.1 | Voyage AI | Best for code and technical text |
| BGE-large-en-v1.5 | 1024 | 512 | ~64.2 | BAAI | Best open-source (English) |
| E5-mistral-7b-instruct | 4096 | 32768 | ~66.6 | Microsoft | Instruction-tuned, very high quality |
| all-MiniLM-L6-v2 | 384 | 256 | ~56.3 | SentenceTransformers | Fast, lightweight |
| Cohere embed-v3 | 1024 | 512 | ~64.5 | Cohere | Compression support, multilingual |
| nomic-embed-text-v1.5 | 768 | 8192 | ~62.3 | Nomic | Open-source, long context |
Matryoshka Embeddings¶
A technique where embeddings are trained so that truncated prefixes (e.g., first 256 dims of a 1536-dim vector) are still useful embeddings:
- Full 3072 dims: Maximum quality for critical applications.
- 1536 dims: 95%+ quality, half the storage.
- 512 dims: 90%+ quality, 1/6 the storage.
- 256 dims: 85%+ quality, fast approximate search.
Enables adaptive quality/cost trade-offs without retraining.
Embedding Best Practices¶
- Match the embedding model to your data: Use code-specialized embeddings for code, multilingual for international content.
- Instruction-tuned embeddings: Models like E5 support different instructions for queries vs. documents, improving retrieval.
- Normalize embeddings: For cosine similarity, ensure vectors are L2-normalized.
- Test on your data: MTEB scores are useful but don't substitute for evaluation on your specific domain.
- Consider dimensions: Higher dimensions capture more nuance but cost more storage and compute.
5. Vector Databases¶
Vector databases are specialized systems for storing and querying high-dimensional vectors efficiently. They are the backbone of RAG systems, enabling fast approximate nearest neighbor (ANN) search.
Popular Vector Databases¶
| Database | Architecture | Key Features | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Auto-scaling, metadata filtering, serverless | Production, ease of use |
| Weaviate | Open-source + cloud | GraphQL API, hybrid search, modules | Complex queries, graph relationships |
| Qdrant | Open-source + cloud | Rich filtering, payload storage, gRPC | High performance, complex filtering |
| Chroma | Open-source | Simple Python API, local-first | Development, prototyping |
| FAISS | Library (Meta) | GPU acceleration, multiple index types | Research, large-scale, in-memory |
| Milvus | Open-source | Distributed, cloud-native (Zilliz Cloud) | Enterprise scale |
| PGvector | PostgreSQL extension | SQL integration, ACID compliance | Existing PostgreSQL users |
| LanceDB | Open-source | Serverless, disk-based, multi-modal | Cost-effective, serverless |
Vector Index Types¶
The key to fast vector search is the index structure. Different index types trade off between speed, accuracy, and memory:
HNSW (Hierarchical Navigable Small World)¶
The most popular index type. Builds a multi-layer graph where:
- Bottom layer contains all vectors, connected to their nearest neighbors.
- Upper layers are progressively sparser, enabling fast navigation.
- Search starts at the top layer and navigates down to find nearest neighbors.
// HNSW Search (simplified)
function hnsw_search(query: Vector, top_k: int) -> list[Result]
// Start at top layer with entry point
current = entry_point
for layer in range(top_layer, 0, -1):
// Greedily navigate to nearest neighbor in this layer
current = greedy_search(query, current, layer)
// Exhaustive search in bottom layer from current position
return beam_search(query, current, layer=0, top_k=top_k)
Parameters: - M: Number of connections per node (higher = better recall, more memory). Typical: 16-64. - ef_construction: Search width during index building (higher = better quality, slower build). Typical: 128-512. - ef_search: Search width during querying (higher = better recall, slower query). Typical: 64-256.
Complexity: O(log n) search time, O(n × M) memory.
IVF (Inverted File Index)¶
Clusters vectors into buckets using k-means, then searches only the nearest clusters:
- Build: Run k-means to create
nlistcluster centroids. - Search: Find nearest
nprobecentroids, then search vectors within those clusters.
Parameters: - nlist: Number of clusters (typical: sqrt(n) to 4×sqrt(n)). - nprobe: Number of clusters to search (higher = better recall, slower).
Complexity: O(nprobe × n/nlist) per query.
LSH (Locality Sensitive Hashing)¶
Hash vectors such that similar vectors hash to the same bucket:
- Uses random hyperplane projections.
- Fast but lower recall than HNSW/IVF.
- Good for very large-scale, approximate search.
Similarity Metrics¶
| Metric | Formula | Properties | When to Use |
|---|---|---|---|
| Cosine Similarity | cos(θ) = (A·B) / (\|\|A\|\| × \|\|B\|\|) |
Scale-invariant, range [-1, 1] | Most common, especially for text embeddings |
| Dot Product | A·B |
Captures both angle and magnitude | When magnitude matters (popularity, relevance scores) |
| Euclidean (L2) | \|\|A - B\|\| |
Distance metric (lower = more similar) | When absolute position matters |
Note: For L2-normalized vectors (unit vectors), cosine similarity and dot product give identical rankings.
Metadata Filtering¶
Vector databases support filtering by metadata (dates, categories, sources) before or after vector search:
// Search for similar vectors with metadata filter
results = vector_db.search(
query_vector=embed("What is RAG?"),
top_k=10,
filter={
"source": {"$in": ["documentation", "tutorials"]},
"date": {"$gte": "2024-01-01"},
"language": "en"
}
)
Pre-filtering: Apply metadata filter first, then vector search on filtered set. Efficient when filter is selective. Post-filtering: Vector search first, then filter results. Better recall but may return fewer results.
Hybrid Search¶
Combine vector (semantic) search with keyword (lexical) search for better coverage:
class HybridSearch:
vector_store: VectorDatabase
bm25_index: BM25Index
function search(query: str, top_k: int = 10, alpha: float = 0.7) -> list[Result]
query_vector = embed(query)
// Semantic search
vector_results = self.vector_store.search(query_vector, top_k=top_k * 3)
// Keyword search (BM25)
keyword_results = self.bm25_index.search(query, top_k=top_k * 3)
// Combine with Reciprocal Rank Fusion (RRF)
combined = reciprocal_rank_fusion(vector_results, keyword_results, k=60)
return combined[:top_k]
function reciprocal_rank_fusion(
*result_lists: list[list[Result]],
k: int = 60
) -> list[Result]
scores = {}
for result_list in result_lists:
for rank, result in enumerate(result_list):
if result.id not in scores:
scores[result.id] = 0
scores[result.id] += 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Why hybrid?: - Vector search finds semantically similar content ("car" matches "automobile"). - Keyword search finds exact matches ("API-KEY-12345" won't have a semantic match). - Combining both covers more cases than either alone.
Pseudocode (Vector Database Operations)¶
class VectorDatabase:
index: HNSWIndex
metadata_store: dict[str, dict]
text_store: dict[str, str]
function upsert(ids: list[str], vectors: list[Tensor], metadata: list[dict], texts: list[str])
for id, vector, meta, text in zip(ids, vectors, metadata, texts):
self.index.add(vector, id=id)
self.metadata_store[id] = meta
self.text_store[id] = text
function search(
query_vector: Tensor,
top_k: int = 10,
filter: dict = None
) -> list[SearchResult]
// Get candidates from vector index
if filter:
candidate_ids = self.apply_metadata_filter(filter)
scores, ids = self.index.search(query_vector, top_k, candidates=candidate_ids)
else:
scores, ids = self.index.search(query_vector, top_k)
results = []
for score, id in zip(scores, ids):
results.append(SearchResult(
id=id,
score=score,
text=self.text_store[id],
metadata=self.metadata_store[id]
))
return results
function delete(ids: list[str])
for id in ids:
self.index.remove(id)
del self.metadata_store[id]
del self.text_store[id]
6. Retrieval Strategies¶
Basic Similarity Search¶
The simplest retrieval: embed the query, find the top-K nearest vectors.
- Pros: Fast, simple, works well for straightforward queries.
- Cons: May miss relevant documents with different wording. Single query may not capture user intent.
Re-ranking¶
After initial retrieval, re-rank results using a more expensive but accurate model:
Cross-Encoder Re-ranking¶
A cross-encoder takes the (query, document) pair as input and outputs a relevance score. Unlike bi-encoders (used for initial retrieval), cross-encoders can model fine-grained interactions between query and document.
class CrossEncoderReranker:
model: CrossEncoder // e.g., ms-marco-MiniLM-L-12-v2
function rerank(query: str, results: list[SearchResult], top_k: int) -> list[SearchResult]
// Score each (query, document) pair
pairs = [(query, result.text) for result in results]
scores = self.model.predict(pairs)
// Sort by cross-encoder score
scored_results = zip(results, scores)
sorted_results = sorted(scored_results, key=lambda x: x[1], reverse=True)
return [result for result, score in sorted_results[:top_k]]
Why re-rank?: Bi-encoders (used for initial retrieval) encode query and document independently — they can't model token-level interactions. Cross-encoders see both together, enabling much higher precision. The trade-off is speed: cross-encoders are ~100× slower, so they're only applied to the top-N candidates.
LLM Re-ranking¶
Use an LLM to assess relevance:
prompt = f"""Rate the relevance of this document to the query on a scale of 1-10.
Query: {query}
Document: {document}
Relevance score (1-10):"""
More expensive but can capture nuanced relevance that embedding similarity misses.
Query Processing Techniques¶
Query Expansion¶
Generate query variations to improve recall:
class QueryExpander:
llm: LanguageModel
function expand(query: str, num_variations: int = 3) -> list[str]
prompt = f"""Generate {num_variations} alternative phrasings of this search query
that might match relevant documents. Include synonyms and related concepts.
Original query: {query}
Alternative queries:"""
variations = self.llm.generate(prompt)
return [query] + parse_variations(variations)
Query Rewriting¶
Transform the user query into a better search query:
function rewrite_query(user_query: str, chat_history: list[dict]) -> str
prompt = f"""Given the conversation history and the latest user question,
formulate a standalone search query that captures the full intent.
Chat history:
{format_history(chat_history)}
Latest question: {user_query}
Standalone search query:"""
return llm.generate(prompt)
This is critical for conversational RAG where the user's query references previous messages (e.g., "Tell me more about that" — "that" must be resolved using chat history).
HyDE (Hypothetical Document Embeddings)¶
Instead of embedding the query directly, generate a hypothetical answer and embed that:
- User asks: "What causes aurora borealis?"
- LLM generates a hypothetical answer (may be imperfect).
- Embed the hypothetical answer (which is in the same "space" as actual documents).
- Search with this embedding.
Why it works: Queries and documents have different linguistic structures. A hypothetical answer is more similar to actual documents than a short query is.
Multi-Query Retrieval¶
Generate multiple search queries, retrieve for each, then merge results:
function multi_query_retrieve(user_query: str, top_k: int) -> list[SearchResult]
// Generate multiple perspectives
queries = generate_query_variations(user_query) // e.g., 3-5 queries
// Retrieve for each
all_results = []
for query in queries:
results = vector_store.search(embed(query), top_k=top_k)
all_results.extend(results)
// Deduplicate and merge (RRF or score aggregation)
merged = reciprocal_rank_fusion(all_results)
return merged[:top_k]
Retrieval Strategy Comparison¶
| Strategy | Recall | Precision | Latency | Cost | Best For |
|---|---|---|---|---|---|
| Basic similarity | Medium | Medium | Low | Low | Simple queries |
| Hybrid (vector + BM25) | High | Medium | Low | Low | General use |
| + Re-ranking | High | High | Medium | Medium | Precision-critical |
| + Query expansion | Very High | Medium | Medium | Medium | Recall-critical |
| + HyDE | High | Medium-High | High | High | Short/vague queries |
| + Multi-query | Very High | High | High | High | Complex queries |
7. Advanced RAG Patterns¶
Naive RAG¶
The simplest implementation:
Query → Embed → Retrieve Top-K → Stuff into Prompt → Generate
Limitations: No query processing, no re-ranking, no handling of conflicting sources, no iterative retrieval.
Advanced RAG¶
Adds pre-retrieval and post-retrieval processing:
Query → Rewrite → Expand → Retrieve → Re-rank → Filter → Augment Prompt → Generate
Specific techniques:
- Parent-Child Chunking: Store small chunks for retrieval precision, but include the parent (larger) chunk in the LLM context for completeness.
- Sentence Window Retrieval: Retrieve at the sentence level for precision, but expand to include surrounding sentences for context.
- Metadata Filtering: Pre-filter by date, source, document type before vector search.
- Contextual Compression: Use an LLM to extract only the relevant portions from retrieved chunks, reducing noise.
Self-RAG (Self-Reflective RAG)¶
The model decides dynamically whether to retrieve, and evaluates the quality of retrieved information:
class SelfRAG:
llm: LanguageModel
retriever: Retriever
function query(question: str) -> str
// Step 1: Decide if retrieval is needed
needs_retrieval = self.llm.generate(
f"Do you need to search for information to answer: '{question}'? Yes/No"
)
if needs_retrieval == "Yes":
// Step 2: Retrieve
documents = self.retriever.search(question)
// Step 3: Evaluate relevance of each document
relevant_docs = []
for doc in documents:
is_relevant = self.llm.generate(
f"Is this document relevant to '{question}'?\nDocument: {doc}\nRelevant? Yes/No"
)
if is_relevant == "Yes":
relevant_docs.append(doc)
// Step 4: Generate with relevant documents
response = self.llm.generate(
f"Context: {relevant_docs}\nQuestion: {question}\nAnswer:"
)
// Step 5: Verify factual consistency
is_supported = self.llm.generate(
f"Is this answer supported by the context?\nContext: {relevant_docs}\nAnswer: {response}\nSupported? Yes/No"
)
if is_supported == "No":
response = self.llm.generate(
f"Revise this answer to be consistent with the context:\n{response}"
)
else:
response = self.llm.generate(f"Answer: {question}")
return response
Corrective RAG (CRAG)¶
Adds a correction step when retrieved documents are insufficient:
- Retrieve documents for the query.
- Evaluate retrieval quality using a lightweight evaluator.
- If quality is high: Use retrieved docs directly.
- If quality is medium: Refine the query and retrieve again.
- If quality is low: Fall back to web search or other knowledge sources.
Graph RAG¶
Combines vector retrieval with knowledge graphs:
- Build a knowledge graph from documents (extract entities and relationships using an LLM).
- Community detection: Group related entities into communities.
- Community summaries: Generate summaries for each community.
- Retrieval: Combine vector search with graph traversal:
- Vector search finds relevant chunks.
- Graph traversal finds related entities and their connections.
- Community summaries provide high-level context.
Benefits: Better at answering questions that require connecting information across multiple documents (multi-hop reasoning).
Pseudocode (Graph RAG):
class GraphRAG:
vector_store: VectorDatabase
knowledge_graph: KnowledgeGraph // Entities + relationships
community_summaries: dict[str, str]
function query(question: str) -> str
// 1. Vector retrieval
vector_results = self.vector_store.search(embed(question), top_k=10)
// 2. Entity extraction from query
entities = extract_entities(question)
// 3. Graph traversal: find related entities and paths
graph_context = []
for entity in entities:
neighbors = self.knowledge_graph.get_neighbors(entity, depth=2)
paths = self.knowledge_graph.find_paths(entities)
graph_context.append(format_graph_info(neighbors, paths))
// 4. Get relevant community summaries
communities = self.knowledge_graph.get_communities(entities)
summaries = [self.community_summaries[c] for c in communities]
// 5. Combine all context
context = combine(vector_results, graph_context, summaries)
return self.llm.generate(build_prompt(question, context))
Agentic RAG¶
An AI agent orchestrates the RAG process, making dynamic decisions:
- Which knowledge bases to search.
- Whether to perform additional retrieval based on initial results.
- How to combine information from multiple sources.
- When to ask for clarification vs. attempt an answer.
class AgenticRAG:
agent: ReActAgent
knowledge_bases: dict[str, VectorDatabase] // Multiple knowledge bases
function query(question: str) -> str
tools = [
Tool("search_docs", "Search documentation", self.search_docs),
Tool("search_code", "Search code repository", self.search_code),
Tool("search_web", "Search the web", self.search_web),
Tool("ask_clarification", "Ask user for clarification", self.ask_user),
]
return self.agent.execute(
goal=f"Answer this question accurately: {question}",
tools=tools,
max_iterations=5
)
RAG Pattern Comparison¶
| Pattern | Complexity | Quality | Latency | Best For |
|---|---|---|---|---|
| Naive RAG | Low | Medium | Low | Prototyping, simple use cases |
| Advanced RAG | Medium | High | Medium | Production use cases |
| Self-RAG | High | Very High | High | Accuracy-critical applications |
| CRAG | Medium | High | Medium | Variable document quality |
| Graph RAG | High | Very High | High | Multi-hop reasoning, complex domains |
| Agentic RAG | Very High | Highest | High | Complex, multi-source queries |
8. RAG Evaluation¶
Evaluating RAG systems requires assessing both retrieval quality and generation quality.
Retrieval Metrics¶
| Metric | Formula | What It Measures |
|---|---|---|
| Precision@K | Relevant retrieved / K | Fraction of top-K results that are relevant |
| Recall@K | Relevant retrieved / Total relevant | Fraction of all relevant docs found in top-K |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant result | How quickly the first relevant result appears |
| NDCG@K | Normalized DCG | Quality of ranking (accounts for position and relevance grade) |
| Hit Rate | Queries with ≥1 relevant result / Total queries | Percentage of queries with any relevant result |
Generation Metrics (RAGAS Framework)¶
RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics:
| Metric | What It Measures | How |
|---|---|---|
| Faithfulness | Is the answer supported by the context? | LLM checks each claim against retrieved docs |
| Answer Relevance | Does the answer address the question? | Generate questions from the answer, compare to original |
| Context Relevance | Is the retrieved context relevant? | LLM rates each retrieved chunk's relevance |
| Context Recall | Does the context contain the ground truth? | Compare retrieved context against reference answer |
| Answer Correctness | Is the answer factually correct? | Compare against ground truth (when available) |
Evaluation Pipeline¶
class RAGEvaluator:
evaluator_llm: LanguageModel // Separate LLM for evaluation
function evaluate(
questions: list[str],
ground_truths: list[str],
rag_system: RAGSystem
) -> dict
results = {"faithfulness": [], "relevance": [], "correctness": []}
for question, truth in zip(questions, ground_truths):
response = rag_system.query(question)
// Faithfulness: is the answer grounded in retrieved context?
faithfulness = self.check_faithfulness(
response.answer, response.sources
)
results["faithfulness"].append(faithfulness)
// Relevance: does the answer address the question?
relevance = self.check_relevance(question, response.answer)
results["relevance"].append(relevance)
// Correctness: does the answer match ground truth?
correctness = self.check_correctness(response.answer, truth)
results["correctness"].append(correctness)
return {k: mean(v) for k, v in results.items()}
End-to-End Evaluation Best Practices¶
- Create a golden dataset: Manually curate question-answer pairs from your actual documents. At least 50-100 pairs for meaningful evaluation.
- Test retrieval and generation separately: Poor generation might be caused by poor retrieval. Isolate the issue.
- Use LLM-as-judge: Use a strong LLM (GPT-4, Claude) to evaluate response quality at scale. Calibrate against human judgments.
- Monitor in production: Track user feedback (thumbs up/down), implicit signals (follow-up questions, session abandonment), and explicit corrections.
- A/B test changes: Any change to chunking, embedding model, retrieval strategy, or prompt should be A/B tested.
9. RAG vs. Fine-tuning¶
| Aspect | RAG | Fine-tuning | RAG + Fine-tuning |
|---|---|---|---|
| Data Requirements | Documents (no labels) | Labeled examples | Both |
| Cost | Low (embedding + inference) | High (training) | High |
| Update Frequency | Real-time (add/remove docs) | Requires retraining | Mixed |
| Knowledge Type | Factual, domain-specific | Behavioral, stylistic | Both |
| Hallucination | Lower (grounded) | Varies | Lowest |
| Best For | Knowledge bases, Q&A, docs | Task behavior, tone, format | Mission-critical applications |
When to use RAG: You have a knowledge base that changes frequently, need verifiable/citable answers, or want to keep the base model general.
When to fine-tune: You need specific output format, tone, or behavior; the task is specialized (medical, legal); or you need the model to "know" domain-specific patterns.
Best approach: Often combined — fine-tune for task behavior and output format, use RAG for factual knowledge.
10. Production Best Practices¶
- Chunking matters most: Invest time in finding the right chunking strategy for your documents. Test multiple approaches and evaluate.
- Embedding model selection: Test 2-3 embedding models on your actual data before committing. Domain-specific models often outperform general ones.
- Hybrid search by default: Combine vector and keyword search. It's almost always better than either alone.
- Re-rank for precision: A cross-encoder re-ranker significantly improves precision with modest latency cost.
- Metadata is powerful: Rich metadata enables filtering that dramatically improves relevance (date ranges, document types, access levels).
- Monitor retrieval quality: Track precision@K, recall@K, and user feedback continuously. Retrieval quality degrades as your corpus grows and changes.
- Handle "I don't know": Instruct the LLM to say when it lacks sufficient context rather than guessing. Measure and optimize the rate of correct "I don't know" responses.
- Citation and attribution: Always include source references in generated answers. This builds trust and enables verification.
- Evaluate end-to-end: Don't just evaluate retrieval or generation in isolation. The system quality is what matters to users.
- Plan for scale: Choose a vector database that handles your expected growth. Consider sharding, replication, and index rebuild strategies.
- Cache wisely: Cache embedding computations, frequently-asked queries, and LLM responses. Semantic caching (similar queries → cached answer) can dramatically reduce costs.
- Security: Implement access controls on documents (not all users should see all content), sanitize retrieved content, and guard against prompt injection through retrieved text.