AI Agents¶

An AI Agent is an autonomous system that perceives its environment, reasons about its goals, makes decisions, and takes actions to achieve those goals. Unlike simple LLMs that generate text in response to a prompt, agents can interact with tools, APIs, databases, file systems, and other systems—enabling complex, multi-step workflows like researching topics, writing and executing code, managing infrastructure, or orchestrating business processes.

The emergence of capable LLMs as reasoning engines has transformed AI agents from academic curiosities (simple rule-based bots) into practical production systems. This chapter covers agent architectures, tool use, the Model Context Protocol (MCP), memory systems, planning strategies, multi-agent collaboration, and production best practices.

1. Agent Fundamentals¶

What Makes an Agent?¶

An agent is more than just an LLM. The key distinction is the action loop — an agent iteratively reasons and acts until a goal is achieved:

Component	LLM (Chat)	AI Agent
Input	Single prompt	Goal + environment state
Processing	Single forward pass	Multiple reasoning + action cycles
Output	Text response	Actions + observations + final answer
Tools	None	Functions, APIs, code execution
Memory	Context window only	Short-term + long-term memory
Autonomy	Responds to each prompt	Operates independently until goal met

Agent Characteristics¶

Autonomy: Operates independently with minimal human intervention. Can decide what steps to take, which tools to use, and when to ask for help.
Reactivity: Responds to changes in the environment (new data, tool failures, user corrections).
Proactiveness: Takes initiative to achieve goals, anticipating what information or actions are needed.
Tool Use: Interacts with external systems through well-defined interfaces.
Memory: Maintains context across interactions, learns from past experiences.
Reflection: Evaluates its own performance and adjusts strategies.

The Agent Loop¶

Every agent follows a variation of this fundamental loop:

function agent_loop(goal: str, max_steps: int) -> str
    state = initialize_state(goal)

    for step in range(max_steps):
        // 1. PERCEIVE: Gather current context
        context = gather_context(state, memory, environment)

        // 2. REASON: Decide what to do next
        thought, action = llm.reason(context)

        // 3. ACT: Execute the chosen action
        if action.type == "final_answer":
            return action.content

        observation = execute_action(action)

        // 4. OBSERVE: Process the result
        state.add(thought, action, observation)

        // 5. REFLECT: Evaluate progress (optional)
        if should_reflect(state):
            reflection = llm.reflect(state)
            state.add_reflection(reflection)

    return "Goal not achieved within step limit."

2. Agent Architectures¶

ReAct (Reasoning + Acting)¶

The most widely used agent pattern. Interleaves reasoning traces with tool actions:

Thought: I need to find the current weather in Paris.
Action: search_web("current weather in Paris")
Observation: Temperature: 15°C, Partly cloudy, Humidity: 65%
Thought: I now have the weather information. Let me format a response.
Action: final_answer("The current weather in Paris is 15°C and partly cloudy with 65% humidity.")

Pseudocode (ReAct Agent):

class ReActAgent:
    llm: LanguageModel
    tools: dict[str, Tool]
    system_prompt: str

    function execute(goal: str, max_iterations: int = 10) -> str
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": goal}
        ]

        for iteration in range(max_iterations):
            // LLM generates thought + action (or final answer)
            response = self.llm.generate(messages, tools=self.tools)

            if response.has_tool_call:
                // Execute the tool
                tool_name = response.tool_call.name
                tool_args = response.tool_call.arguments
                tool = self.tools[tool_name]

                try:
                    observation = tool.execute(**tool_args)
                except Exception as e:
                    observation = f"Error: {str(e)}"

                // Add action + observation to conversation
                messages.append({"role": "assistant", "content": response.text, "tool_call": response.tool_call})
                messages.append({"role": "tool", "content": observation, "tool_call_id": response.tool_call.id})
            else:
                // Final answer
                return response.text

        return "Maximum iterations reached."

Plan-and-Execute¶

Separates planning from execution. First create a plan, then execute each step:

class PlanAndExecuteAgent:
    planner_llm: LanguageModel    // Creates the plan
    executor_llm: LanguageModel   // Executes each step
    tools: dict[str, Tool]

    function execute(goal: str) -> str
        // Step 1: Create a plan
        plan = self.planner_llm.generate(
            f"Create a step-by-step plan to achieve: {goal}\n"
            f"Available tools: {list(self.tools.keys())}\n"
            f"Return a numbered list of steps."
        )
        steps = parse_plan(plan)

        // Step 2: Execute each step
        results = []
        for i, step in enumerate(steps):
            context = f"Goal: {goal}\nPlan: {plan}\n"
            context += f"Completed: {results}\n"
            context += f"Current step ({i+1}/{len(steps)}): {step}"

            result = self.execute_step(step, context)
            results.append(result)

            // Re-plan if needed (step failed or new information)
            if should_replan(result, steps[i+1:]):
                remaining_steps = self.replan(goal, results, steps[i+1:])
                steps = steps[:i+1] + remaining_steps

        // Step 3: Synthesize final answer
        return self.synthesize(goal, results)

Benefits: Better for complex, multi-step tasks. The plan provides a roadmap that can be adjusted. Drawbacks: Planning overhead, plan may become stale as new information emerges.

Reflexion¶

Agent reflects on past actions and failures to improve future performance:

class ReflexionAgent:
    llm: LanguageModel
    tools: dict[str, Tool]
    reflection_memory: list[str]  // Past reflections

    function execute(goal: str, max_trials: int = 3) -> str
        for trial in range(max_trials):
            // Include past reflections in context
            context = f"Goal: {goal}\n"
            if self.reflection_memory:
                context += f"Lessons from past attempts:\n"
                for r in self.reflection_memory:
                    context += f"- {r}\n"

            // Execute with ReAct
            result = self.react_execute(goal, context)

            // Evaluate result
            if self.evaluate(goal, result):
                return result

            // Reflect on failure
            reflection = self.llm.generate(
                f"Your attempt to achieve '{goal}' produced: {result}\n"
                f"This was not satisfactory. Reflect on what went wrong "
                f"and what you should do differently next time."
            )
            self.reflection_memory.append(reflection)

        return "Failed after maximum trials."

Architecture Comparison¶

Architecture	Planning	Reflection	Best For	Complexity
ReAct	Implicit (step-by-step)	None	Simple to moderate tasks	Low
Plan-and-Execute	Explicit upfront plan	Re-planning	Complex multi-step tasks	Medium
Reflexion	Implicit + learned	Explicit reflection loop	Tasks requiring iteration	Medium
Tree of Thoughts	Explores multiple paths	Evaluates branches	Problems with multiple strategies	High
LATS (Language Agent Tree Search)	Monte Carlo tree search	Value function	Optimal action sequences	Very High

3. Tools and Function Calling¶

Agents interact with the world through tools — functions that can be invoked to perform actions or retrieve information.

Function Calling¶

Modern LLMs support structured function calling, where the model outputs a structured JSON object specifying which function to call and with what arguments.

Tool Definition (OpenAI/Anthropic format):

{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "search_web",
                "description": "Search the web for current information on any topic. Use this when you need up-to-date information that might not be in your training data.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query. Be specific and include relevant keywords."
                        },
                        "max_results": {
                            "type": "integer",
                            "description": "Maximum number of results to return",
                            "default": 5
                        }
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "execute_python",
                "description": "Execute Python code in a sandboxed environment. Use for calculations, data processing, or generating visualizations.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "code": {
                            "type": "string",
                            "description": "Python code to execute"
                        }
                    },
                    "required": ["code"]
                }
            }
        }
    ]
}

Common Tool Categories¶

Category	Examples	Use Cases
Web Search	Tavily, Serper, Brave Search API	Current information, fact-checking
Code Execution	Python interpreter, Jupyter, sandboxed environments	Calculations, data processing, testing
File Operations	Read/write files, directory listing	Document processing, code modification
Database	SQL queries, vector search, graph queries	Data retrieval, knowledge base access
API Integration	REST/GraphQL calls, webhooks	External service interaction
Browser	Playwright, Puppeteer, browser automation	Web scraping, form filling, testing
Communication	Email, Slack, Teams APIs	Notifications, collaboration
Computation	Calculator, unit conversion, statistics	Mathematical operations

Tool Execution Safety¶

class SafeToolExecutor:
    allowed_tools: set[str]
    confirmation_required: set[str]  // Tools that need human approval
    rate_limiter: RateLimiter
    sandbox: Sandbox

    function execute(tool_call: ToolCall) -> str
        // 1. Validate tool exists and is allowed
        if tool_call.name not in self.allowed_tools:
            return f"Error: Tool '{tool_call.name}' is not available."

        // 2. Rate limiting
        if not self.rate_limiter.allow(tool_call.name):
            return "Error: Rate limit exceeded. Try again later."

        // 3. Validate parameters
        if not validate_params(tool_call.name, tool_call.arguments):
            return f"Error: Invalid parameters for {tool_call.name}."

        // 4. Human confirmation for sensitive tools
        if tool_call.name in self.confirmation_required:
            approved = request_human_approval(tool_call)
            if not approved:
                return "Action cancelled by user."

        // 5. Execute in sandbox
        try:
            result = self.sandbox.execute(tool_call.name, tool_call.arguments)
            return truncate_if_needed(result, max_length=10000)
        except TimeoutError:
            return "Error: Tool execution timed out."
        except Exception as e:
            return f"Error: {str(e)}"

Tool Design Best Practices¶

Clear descriptions: Tool descriptions are part of the LLM prompt. Write them as if explaining to a colleague. Include examples of when to use the tool.
Well-defined schemas: Use precise JSON Schema with descriptions for each parameter. Constrain types and provide defaults.
Error messages: Return clear, actionable error messages so the agent can recover.
Idempotency: Where possible, make tools idempotent (safe to retry).
Output format: Return structured data that's easy for the LLM to parse. Truncate large outputs.
Granularity: Prefer specific, focused tools over monolithic ones. search_documentation is better than do_anything.

4. Model Context Protocol (MCP)¶

Model Context Protocol (MCP) is an open standard (introduced by Anthropic in 2024) that standardizes how AI applications interact with external tools, data sources, and systems. MCP provides a universal interface that replaces the need for custom integrations with each tool or data source.

Why MCP?¶

Without MCP, every AI application must build custom integrations:

Before MCP:
  App A → Custom integration → Tool 1
  App A → Custom integration → Tool 2
  App B → Different integration → Tool 1
  App B → Different integration → Tool 2
  (N apps × M tools = N×M integrations)

With MCP:
  App A → MCP Client → MCP Server → Tool 1
  App B → MCP Client → MCP Server → Tool 2
  (N apps × 1 protocol + M servers = N + M integrations)

MCP Architecture¶

MCP uses a client-server model with three roles:

MCP Host: The AI application (e.g., Claude Desktop, Cursor, custom app) that serves as the user-facing interface.
MCP Client: Resides within the host, manages connections to MCP servers, translates between the host's needs and the MCP protocol.
MCP Server: A lightweight program that exposes capabilities through the standard MCP interface. Each server provides access to specific tools or data sources.

MCP Capabilities¶

MCP servers can expose three types of capabilities:

1. Resources¶

Read-only data sources that the client can access:

{
    "resources": [
        {
            "uri": "file:///path/to/codebase",
            "name": "Project Source Code",
            "description": "The current project's source code repository",
            "mimeType": "text/plain"
        },
        {
            "uri": "postgres://localhost/mydb/users",
            "name": "Users Table",
            "description": "Database table containing user information"
        }
    ]
}

Resources support: - Static resources: Known URIs listed upfront. - Dynamic resources: Resources discovered via templates (e.g., file:///{path}). - Subscriptions: Clients can subscribe to resource changes.

2. Tools¶

Functions that the server can execute:

{
    "tools": [
        {
            "name": "query_database",
            "description": "Execute a read-only SQL query against the database",
            "inputSchema": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "SQL SELECT query to execute"
                    }
                },
                "required": ["query"]
            }
        }
    ]
}

3. Prompts¶

Pre-defined prompt templates that guide LLM interactions:

{
    "prompts": [
        {
            "name": "code_review",
            "description": "Review code for bugs and improvements",
            "arguments": [
                {
                    "name": "code",
                    "description": "The code to review",
                    "required": true
                },
                {
                    "name": "language",
                    "description": "Programming language",
                    "required": false
                }
            ]
        }
    ]
}

MCP Transport¶

MCP supports two transport mechanisms:

Transport	Protocol	Use Case
stdio	Standard input/output	Local servers, same machine
HTTP + SSE	HTTP for requests, Server-Sent Events for notifications	Remote servers, network access

MCP Server Example¶

A simple MCP server that provides file system access:

class FileSystemMCPServer:
    name: str = "filesystem"
    version: str = "1.0.0"
    root_path: str

    // List available tools
    function list_tools() -> list[Tool]
        return [
            Tool(
                name="read_file",
                description="Read the contents of a file",
                input_schema={"path": {"type": "string", "description": "File path relative to root"}}
            ),
            Tool(
                name="write_file",
                description="Write content to a file",
                input_schema={
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                }
            ),
            Tool(
                name="list_directory",
                description="List files in a directory",
                input_schema={"path": {"type": "string"}}
            )
        ]

    // List available resources
    function list_resources() -> list[Resource]
        return [
            Resource(
                uri=f"file://{self.root_path}",
                name="Project Root",
                description="Root directory of the project"
            )
        ]

    // Execute a tool
    function call_tool(name: str, arguments: dict) -> str
        match name:
            case "read_file":
                path = resolve_path(self.root_path, arguments["path"])
                validate_path_within_root(path, self.root_path)  // Security!
                return read_file(path)
            case "write_file":
                path = resolve_path(self.root_path, arguments["path"])
                validate_path_within_root(path, self.root_path)
                write_file(path, arguments["content"])
                return "File written successfully."
            case "list_directory":
                path = resolve_path(self.root_path, arguments["path"])
                return list_files(path)

MCP vs. Traditional Function Calling¶

Aspect	Traditional Function Calling	MCP
Scope	Single application	Cross-application, standardized
Discovery	Hardcoded tool definitions	Dynamic server/tool discovery
Security	App-specific implementation	Protocol-level security and permissions
Extensibility	Requires code changes	Add new servers without app changes
Ecosystem	Fragmented, vendor-specific	Open standard, growing ecosystem
State Management	Custom per tool	Standardized resource subscriptions

MCP Ecosystem¶

Growing ecosystem of MCP servers:

File systems: Local and remote file access.
Databases: PostgreSQL, SQLite, MongoDB connectors.
APIs: GitHub, Slack, Jira, Google Drive.
Search: Web search, documentation search.
Development: Git operations, code analysis, testing.
Knowledge: Wikipedia, Arxiv, documentation sites.

5. Memory Systems¶

Agents need memory to maintain context across interactions, learn from experience, and handle tasks that span multiple sessions.

Memory Types¶

Short-term Memory (Working Memory)¶

Information available during the current task:

Conversation History: Recent messages between user and agent.
Scratchpad: Intermediate results, thoughts, observations from current task.
Context Window: Limited by the LLM's context window (2K-200K tokens).

Challenge: Context windows are finite. As conversations grow, older information must be compressed or dropped.

Long-term Memory¶

Persistent information across sessions:

Episodic Memory: Records of past interactions and their outcomes. "Last time the user asked about X, they wanted Y format."
Semantic Memory: Factual knowledge and learned patterns. Stored in vector databases for retrieval.
Procedural Memory: Learned workflows and tool usage patterns. "To deploy code, first run tests, then create a PR, then merge."

Memory Implementation Patterns¶

class AgentMemory:
    // Short-term
    conversation: list[Message]
    scratchpad: list[str]

    // Long-term
    episodic_store: VectorDatabase     // Past interactions
    semantic_store: VectorDatabase     // Facts and knowledge
    entity_store: dict[str, EntityInfo] // Tracked entities

    function add_interaction(user_msg: str, agent_response: str, outcome: str)
        // Add to conversation (short-term)
        self.conversation.append(Message("user", user_msg))
        self.conversation.append(Message("assistant", agent_response))

        // Store in episodic memory (long-term)
        episode = f"User asked: {user_msg}\nAgent did: {agent_response}\nOutcome: {outcome}"
        embedding = embed(episode)
        self.episodic_store.add(embedding, metadata={"timestamp": now(), "outcome": outcome})

        // Extract and update entities
        entities = extract_entities(user_msg + agent_response)
        for entity in entities:
            self.entity_store[entity.name] = entity

    function get_relevant_context(current_query: str, max_tokens: int) -> str
        context_parts = []
        remaining_tokens = max_tokens

        // 1. Recent conversation (highest priority)
        recent = self.conversation[-10:]
        recent_text = format_messages(recent)
        context_parts.append(("Recent conversation", recent_text))
        remaining_tokens -= count_tokens(recent_text)

        // 2. Relevant past episodes
        query_embedding = embed(current_query)
        episodes = self.episodic_store.search(query_embedding, top_k=5)
        episode_text = format_episodes(episodes)
        if count_tokens(episode_text) < remaining_tokens * 0.3:
            context_parts.append(("Relevant past interactions", episode_text))
            remaining_tokens -= count_tokens(episode_text)

        // 3. Relevant entities
        mentioned_entities = extract_entities(current_query)
        entity_text = format_entities(mentioned_entities, self.entity_store)
        context_parts.append(("Known entities", entity_text))

        return compile_context(context_parts)

    function summarize_if_needed()
        // Compress old conversation history
        if count_tokens(self.conversation) > MAX_CONVERSATION_TOKENS:
            old_messages = self.conversation[:-10]
            summary = llm.generate(f"Summarize this conversation:\n{format_messages(old_messages)}")
            self.conversation = [Message("system", f"Previous conversation summary: {summary}")] + self.conversation[-10:]

Memory Management Strategies¶

Strategy	Approach	Pros	Cons
Full History	Keep all messages in context	Complete context	Exceeds token limits quickly
Sliding Window	Keep last N messages	Simple, predictable	Loses old context
Summary + Recent	Summarize old, keep recent detailed	Balance of context and recency	Summary may lose details
RAG Memory	Embed and retrieve relevant history	Scales indefinitely	Retrieval may miss relevant context
Entity Memory	Track key entities and their states	Efficient, structured	Limited to entity-centric info
Hybrid	Combine multiple strategies	Best overall	Most complex

6. Planning and Reasoning¶

Chain-of-Thought (CoT)¶

Encourage the model to reason step-by-step before acting:

System: Think through problems step by step before taking action.

User: How many R's are in "strawberry"?

Agent:
Thought: Let me count the R's in "strawberry" letter by letter.
s-t-r-a-w-b-e-r-r-y
Position 3: r
Position 8: r
Position 9: r
There are 3 R's in "strawberry".

Tree of Thoughts (ToT)¶

Explore multiple reasoning paths and evaluate which is most promising:

class TreeOfThoughts:
    llm: LanguageModel
    evaluator: LanguageModel
    branching_factor: int = 3
    max_depth: int = 4

    function solve(problem: str) -> str
        root = ThoughtNode(state=problem)
        best_solution = self.search(root, depth=0)
        return best_solution

    function search(node: ThoughtNode, depth: int) -> str
        if depth >= self.max_depth:
            return node.state

        // Generate multiple next thoughts
        candidates = []
        for _ in range(self.branching_factor):
            next_thought = self.llm.generate(
                f"Given the current state:\n{node.state}\n"
                f"What is a promising next step?"
            )
            candidates.append(ThoughtNode(state=node.state + "\n" + next_thought))

        // Evaluate candidates
        scores = []
        for candidate in candidates:
            score = self.evaluator.generate(
                f"Rate the promise of this reasoning path (1-10):\n{candidate.state}"
            )
            scores.append(float(score))

        // Pursue the best candidate
        best_idx = argmax(scores)
        return self.search(candidates[best_idx], depth + 1)

Task Decomposition¶

Break complex goals into manageable sub-tasks:

function decompose_task(goal: str) -> list[SubTask]
    prompt = f"""Break down this goal into a sequence of concrete, actionable sub-tasks.
Each sub-task should be achievable with available tools.

Goal: {goal}

Sub-tasks (numbered list):"""

    plan = llm.generate(prompt)
    return parse_subtasks(plan)

Reasoning Patterns Comparison¶

Pattern	Description	Strengths	When to Use
Direct	Single LLM call	Fast, cheap	Simple tasks
CoT	Step-by-step reasoning	Better accuracy	Math, logic, analysis
ReAct	Reasoning + tool use	Grounded in real data	Tasks needing external info
ToT	Explore multiple paths	Finds optimal solutions	Creative problem solving
Reflexion	Learn from failures	Improves over trials	Complex, error-prone tasks
Plan-Execute	Upfront planning	Structured approach	Multi-step workflows

7. Multi-Agent Systems¶

Why Multiple Agents?¶

Complex tasks often benefit from specialization. Multiple agents can:

Divide labor: Each agent specializes in a domain (coding, research, review).
Parallel execution: Independent sub-tasks run simultaneously.
Debate and verify: Agents check each other's work.
Scale: Add agents as complexity grows.

Multi-Agent Patterns¶

Hierarchical (Manager-Worker)¶

A manager agent delegates tasks to specialized worker agents:

class ManagerAgent:
    workers: dict[str, Agent]  // role → agent
    llm: LanguageModel

    function execute(goal: str) -> str
        // Plan and delegate
        plan = self.llm.generate(
            f"Break this goal into tasks and assign to available workers: "
            f"{list(self.workers.keys())}\nGoal: {goal}"
        )
        assignments = parse_assignments(plan)

        // Execute assignments
        results = {}
        for role, task in assignments:
            worker = self.workers[role]
            results[role] = worker.execute(task)

        // Synthesize results
        synthesis = self.llm.generate(
            f"Synthesize these results into a final answer:\n"
            f"Goal: {goal}\nResults: {results}"
        )
        return synthesis

Collaborative (Peer-to-Peer)¶

Agents communicate directly, sharing information and coordinating:

class CollaborativeSystem:
    agents: list[Agent]
    shared_memory: SharedMemory

    function execute(goal: str, max_rounds: int = 5) -> str
        self.shared_memory.set("goal", goal)

        for round in range(max_rounds):
            for agent in self.agents:
                // Each agent reads shared state and contributes
                context = self.shared_memory.get_relevant(agent.role)
                contribution = agent.execute(context)
                self.shared_memory.add(agent.role, contribution)

            // Check if goal is achieved
            if self.is_goal_achieved():
                return self.shared_memory.get_final_answer()

        return self.shared_memory.get_best_answer()

Debate / Adversarial¶

Agents argue different positions, and a judge selects the best:

class DebateSystem:
    proposer: Agent
    critic: Agent
    judge: Agent

    function execute(question: str) -> str
        // Proposer generates initial answer
        proposal = self.proposer.execute(f"Answer: {question}")

        // Critic challenges the answer
        critique = self.critic.execute(
            f"Critically evaluate this answer. Find flaws, missing info, or errors:\n"
            f"Question: {question}\nAnswer: {proposal}"
        )

        // Proposer revises based on critique
        revised = self.proposer.execute(
            f"Revise your answer based on this critique:\n"
            f"Original: {proposal}\nCritique: {critique}"
        )

        // Judge selects best answer
        final = self.judge.execute(
            f"Select the best answer and explain why:\n"
            f"Original: {proposal}\nRevised: {revised}\nCritique: {critique}"
        )

        return final

Popular Agent Frameworks¶

Framework	Language	Architecture	Key Features	Best For
LangChain	Python/JS	Modular chains + agents	Extensive tool library, memory, callbacks	General-purpose agents
LangGraph	Python/JS	Graph-based workflows	Stateful, cyclical agent graphs, checkpointing	Complex agent workflows
LlamaIndex	Python	Data framework + agents	RAG-focused, query engines, data connectors	RAG-centric agents
AutoGen	Python	Multi-agent conversations	Code execution, group chat, flexible patterns	Multi-agent collaboration
CrewAI	Python	Role-based agents	Tasks, roles, delegation, process management	Team-based automation
Semantic Kernel	C#/Python/Java	Enterprise SDK	Microsoft ecosystem, planners, plugins	Enterprise applications
Haystack	Python	Pipeline-based	Modular components, production-ready	Production RAG + agents
Pydantic AI	Python	Type-safe agents	Strong typing, validation, testing support	Type-safe agent development

Framework Selection Guide¶

Starting out: LangChain or CrewAI for rapid prototyping.
Complex workflows: LangGraph for stateful, graph-based agent flows.
RAG-focused: LlamaIndex for data-heavy applications.
Multi-agent: AutoGen or CrewAI for collaborative agent systems.
Enterprise: Semantic Kernel for Microsoft stack integration.
Production: LangGraph or Haystack for robust, scalable deployments.

8. Agent Evaluation¶

Metrics¶

Metric	What It Measures	How to Compute
Task Success Rate	% of goals fully achieved	Binary: goal met or not
Partial Completion	How much of the goal was achieved	Rubric-based scoring
Steps to Completion	Efficiency (fewer steps = better)	Count reasoning + action steps
Tool Accuracy	Correct tool selection and parameters	Compare to gold-standard tool calls
Cost	Total token usage and API costs	Sum prompt + completion tokens
Latency	Time from query to final answer	Wall-clock time
Error Recovery	Ability to recover from tool failures	Count successful recoveries

Evaluation Approaches¶

Unit tests: Test individual tool calls with expected inputs/outputs.
Integration tests: Test complete agent workflows on predefined scenarios.
Benchmarks: Use established benchmarks (SWE-bench for coding, WebArena for web tasks, GAIA for general agents).
Human evaluation: Have humans rate agent outputs for quality, correctness, and helpfulness.
LLM-as-judge: Use a strong LLM to evaluate agent performance at scale.

9. Production Best Practices¶

Reliability¶

Max iterations: Always set a maximum number of agent steps to prevent infinite loops.
Timeouts: Set timeouts on tool execution to prevent hanging.
Retry logic: Implement retries with exponential backoff for transient failures.
Fallbacks: When the agent fails, fall back to simpler approaches or ask for human help.
Checkpointing: Save agent state periodically so long tasks can resume after failures.

Safety¶

Tool permissions: Whitelist allowed tools per agent. Read-only agents shouldn't have write tools.
Human-in-the-loop: Require approval for high-impact actions (database writes, financial transactions, emails).
Sandboxing: Run code execution in isolated containers with resource limits.
Input validation: Validate all tool inputs before execution. Prevent path traversal, SQL injection, etc.
Output filtering: Check agent outputs for sensitive information, harmful content, or PII leakage.
Audit logging: Log all agent actions, tool calls, and decisions for debugging and compliance.

Cost Control¶

Model selection: Use smaller models for simple reasoning steps, larger models for complex decisions. Implement model routing.
Token budgets: Set per-task and per-session token limits.
Caching: Cache tool results and LLM responses for repeated operations.
Batching: Batch similar tool calls when possible.
Early termination: Stop when the agent has enough information to answer, don't over-retrieve.

Observability¶

Structured logging: Log each step (thought, action, observation) with timestamps and metadata.
Tracing: Use tools like LangSmith, Arize Phoenix, or OpenTelemetry for end-to-end traces.
Metrics: Track success rate, latency, cost, and error rates per agent type.
Dashboards: Visualize agent performance trends over time.
Alerting: Alert on anomalies (sudden increase in failures, cost spikes, latency degradation).

Scaling¶

Async execution: Run tool calls asynchronously when they're independent.
Queue-based: Use message queues for long-running agent tasks.
Stateless design: Store agent state externally (database, cache) so any worker can resume.
Horizontal scaling: Run multiple agent instances behind a load balancer.
Resource limits: Cap concurrent agent executions per user and globally.